Intermittent service disruptions in Automated Software Integration

Incident Report for Fluid Attacks

Postmortem

Impact

An unknown number of users encountered difficulties with the Automated Software integration due to machine scaling issues in the CI. The issue started on UTC-5 24-02-06 08:00 and was proactively discovered 1.1 days (TTD) later by the product team during their regular workflow. The problem was resolved in 2.8 hours (TTF) resulting in a total impact of 1.2 days (TTR). [1].

Cause

The workers' disks were nearly full, leading to issues when the worker’s operating systems slightly increased in size.

Solution

The size of the workers' disks was increased [2].

Conclusion

Gradually, the workers' operating systems grew beyond the disk capacity, leading to the issue. Increasing disk size by 50% mitigates future occurrences. IMPOSSIBLE_TO_TEST

Posted Feb 08, 2024 - 14:25 GMT-05:00

Resolved

The incident has been resolved, and the CI is now operating normally.

Posted Feb 07, 2024 - 17:22 GMT-05:00

Update

Still investigating the root cause.

Posted Feb 07, 2024 - 15:19 GMT-05:00

Update

The assigned developer is investigating the root cause.

Posted Feb 07, 2024 - 13:23 GMT-05:00

Identified

The CI is experiencing slowdowns and intermittent unavailability.

Posted Feb 07, 2024 - 11:45 GMT-05:00

This incident affected: Dependencies (AWS ec2-us-east-1).