At least 4 organizations experienced a delay in cloning their repositories (UTC-05 23-08-29 17:14 to 23-08-31 13:30: 2 days -time to recover-). The incident was detected reactively (at UTC-5 23-08-31 11:36: 2 days -time to detect-) by several users who noted multiple issues related to cloning and reported the issue through our help team [1].
The merge request at [2] triggered an issue where the servers tasked with processing these jobs sent requests to a non-existent SQS queue. This, in turn, led to an exception being raised, rendering the server unresponsive and unable to process tasks, ultimately resulting in a buildup of pending tasks.
The engineering team reverted the commit that introduced the problem [3].
It is imperative to adeptly handle exceptions within the server tasked with managing operations, ensuring the seamless continuation of other processes and preventing undesirable freezing of the system. UNHANDLED_EXCEPTIONS