Impact
An unknown number of users found extended response times in Fluid Attacks' Platform (at UTC-5 23-11-09 07:45 to 23-11-09 11:10 | Time to recover was 3.3 hours). The incident was discovered proactively (at UTC-5 23-11-09 07:54 | Time to detect was 9 minutes) by one of our monitoring tools and staff members, indicating times above 5 seconds of web response in this component.
Cause
Our API experienced an abnormal amount of requests. This task did not scale properly, affecting our API's general performance and our Platform and Agent.
Solution
The engineering team lightened the operation by deactivating some tasks as indicators while investigating and optimizing this process [1][2].
Conclusion
The operation had kept that configuration for some time. However, this abnormal stress situation for this task brought the problem to light. Now, the team is adding tests for the performance of API operations in scenarios where a lot of data is loaded, improving the observability [3]. PERFORMANCE_ERROR < MISSING_TEST