Git root remaining on Queued state after check/uncheck Secure access option

Incident Report for Fluid Attacks

Postmortem

Impact

At least two groups had root clones queued indefinitely after check or uncheck in the Secure Access option. The issue started on UTC-5 23-09-04 20:28 and was proactively discovered 1.6 days (TTD) later by an agent from the experience team who noticed that when enabling/disabling secure access on a root, the cloning was on hold indefinitely and reported it through our help desk [1]. The problem was resolved in 7.8 days (TTF) resulting in a total impact of 9.5 days (TTR).

Cause

We were implementing a "Secure Access" model for our customers, in which client roots would be cloning with AWS Batch instead of Kubernetes [2]. When cloning was executed from the graphical interface with the new architecture, a resource allocation error occurred in Batch since some resources were duplicated when they were defined, and this caused the cloning to remain queued indefinitely.

Solution

For the clients that were affected, the clones were unlocked manually. Triggers were configured to execute the clones after checking/unchecking the secure access option, resources were allocated, code was refactored to remove duplications, and exceptions were handled [3][4][5].

Conclusion

The problem was caused by a resource allocation error when changing the infrastructure for cloning client repositories. IMPOSSIBLE_TO_TEST

Posted Sep 18, 2023 - 17:13 GMT-05:00

Resolved

The engineering team has solved the problem and the cloning is working normally.

Posted Sep 15, 2023 - 12:22 GMT-05:00

Identified

Some cases have been detected where root cloning got stuck when switching between enabling and disabling secure access.Workaround: request the release of the repository from the cloning queue through help@fluidattacks.com.

Posted Sep 06, 2023 - 18:21 GMT-05:00

This incident affected: Web.