Around 7am PT, one of our app server nodes (and alas, also our primary redis server) started exhibiting average network (ping) latency of several tens of ms — spiking to >100ms — to our DB master and other nodes in the cluster.
We have removed the app server from use, and failed over to replicas as of 10 am PT.
We are bringing on additional capacity to service the backlog as quickly as we can.
We’ll update this page as we make more progress.
Update: we are in communications with our infrastructure provider to get more information on the root cause of this situation.
Update 5:15pm: we have restored capacity and mostly drained the backlog of builds. We continue communication with our infrastructure provider to understand root cause.