Impact
An unknown number of groups had problems in the repository cloning process (at UTC-5 23-11-07 05:53 to 23-11-07 12:47 | Time to recover was 4.08 hours). The incident was discovered proactively (at UTC-5 23-11-07 08:46 | Time to detect was 2.8 hours) by a member of the Fluid Attacks team [1] who encountered a No space left on device message in various groups inside the Platform.
Cause
There was a change to the type of virtual server instances used by Fluid Attacks to execute some tasks [2]. The new virtual servers that were processing the cloning of repositories have smaller storage space, and due to the architecture implementation of this process, the failure occurred, and the error message was displayed.
Solution
Some modifications were made to the architecture of the cloning process, reducing the number of tasks executed concurrently per instance [3].
Conclusion
Currently, there is no existing test for this part of the infrastructure because it is impossible to run it locally or test this kind of change before it goes to production [4]. Now, the product team is working on some changes to improve the cluster's robustness, reproducibility, and observability [5][6]. INFRASTRUCTURE_ERROR < IMPOSSIBLE_TO_TEST