Intermittent service disruptions in Automated Software Integration
Incident Report for Fluid Attacks
Postmortem

Impact

An unknown number of users encountered difficulties with the Automated Software integration due to machine scaling issues in the CI. The issue started on UTC-5 24-02-06 08:00 and was proactively discovered 1.1 days (TTD) later by the product team during their regular workflow. The problem was resolved in 2.8 hours (TTF) resulting in a total impact of 1.2 days (TTR). [1].

Cause

The workers' disks were nearly full, leading to issues when the worker’s operating systems slightly increased in size.

Solution

The size of the workers' disks was increased [2].

Conclusion

Gradually, the workers' operating systems grew beyond the disk capacity, leading to the issue. Increasing disk size by 50% mitigates future occurrences. IMPOSSIBLE_TO_TEST

Posted Feb 08, 2024 - 14:25 GMT-05:00

Resolved
The incident has been resolved, and the CI is now operating normally.
Posted Feb 07, 2024 - 17:22 GMT-05:00
Update
Still investigating the root cause.
Posted Feb 07, 2024 - 15:19 GMT-05:00
Update
The assigned developer is investigating the root cause.
Posted Feb 07, 2024 - 13:23 GMT-05:00
Identified
The CI is experiencing slowdowns and intermittent unavailability.
Posted Feb 07, 2024 - 11:45 GMT-05:00
This incident affected: Dependencies (AWS ec2-us-east-1).