At least two groups had root clones queued indefinitely after check or uncheck in the Secure Access option (UTC-05 23-09-04 20:28 to 23-09-14 08:27: 9.5 days -time to recover-). The incident was detected proactively (at UTC-5 23-09-06 13:21: 1.8 days -time to detect-) by an agent from the experience team who noticed that when enabling/disabling secure access on a root, the cloning was on hold indefinitely and reported it through our help desk [1].
We were implementing a "Secure Access" model for our customers, in which the cloning of client roots would be performed with AWS Batch instead of Kubernetes [2]. When a cloning was executed from the graphical interface with the new architecture, a resource allocation error occurred in Batch, since some resources were duplicated when they were defined and this caused the cloning to remain queued indefinitely.
For the clients that were affected, the clones were unlocked manually, then triggers were configured to execute the clones after check/uncheck secure access option, resources were allocated, code was refactored to remove duplications and exceptions were handled [3][4][5].
The problem was caused by a resource allocation error when changing the infrastructure for cloning client repositories. IMPOSSIBLE_TO_TEST