At least two groups had root clones queued indefinitely after check or uncheck in the Secure Access option. The issue started on UTC-5 23-09-04 20:28 and was proactively discovered 1.6 days (TTD) later by an agent from the experience team who noticed that when enabling/disabling secure access on a root, the cloning was on hold indefinitely and reported it through our help desk [1]. The problem was resolved in 7.8 days (TTF) resulting in a total impact of 9.5 days (TTR).
We were implementing a "Secure Access" model for our customers, in which client roots would be cloning with AWS Batch instead of Kubernetes [2]. When cloning was executed from the graphical interface with the new architecture, a resource allocation error occurred in Batch since some resources were duplicated when they were defined, and this caused the cloning to remain queued indefinitely.
For the clients that were affected, the clones were unlocked manually. Triggers were configured to execute the clones after checking/unchecking the secure access option, resources were allocated, code was refactored to remove duplications, and exceptions were handled [3][4][5].
The problem was caused by a resource allocation error when changing the infrastructure for cloning client repositories. IMPOSSIBLE_TO_TEST