Git root remaining on Queued state after check/uncheck Secure access option
Incident Report for Fluid Attacks
Postmortem

Impact

At least two groups had root clones queued indefinitely after check or uncheck in the Secure Access option (UTC-05 23-09-04 20:28 to 23-09-14 08:27: 9.5 days -time to recover-). The incident was detected proactively (at UTC-5 23-09-06 13:21: 1.8 days -time to detect-) by an agent from the experience team who noticed that when enabling/disabling secure access on a root, the cloning was on hold indefinitely and reported it through our help desk [1].

Cause

We were implementing a "Secure Access" model for our customers, in which the cloning of client roots would be performed with AWS Batch instead of Kubernetes [2]. When a cloning was executed from the graphical interface with the new architecture, a resource allocation error occurred in Batch, since some resources were duplicated when they were defined and this caused the cloning to remain queued indefinitely.

Solution

For the clients that were affected, the clones were unlocked manually, then triggers were configured to execute the clones after check/uncheck secure access option, resources were allocated, code was refactored to remove duplications and exceptions were handled [3][4][5].

Conclusion

The problem was caused by a resource allocation error when changing the infrastructure for cloning client repositories. IMPOSSIBLE_TO_TEST

Posted Sep 18, 2023 - 17:13 GMT-05:00

Resolved
The engineering team has solved the problem and the cloning is working normally.
Posted Sep 15, 2023 - 12:22 GMT-05:00
Identified
Some cases have been detected where root cloning got stuck when switching between enabling and disabling secure access.
Posted Sep 06, 2023 - 18:21 GMT-05:00
This incident affected: Web.