2020

December 11, 2020
in Kubernetes, Database
4 min read

All services outage

At 19:55 UTC, all services became unresponsive. The DevOps team were already in a call, and immediately started to investigate.

Postgres was running at 100% CPU usage due to a VACUUM, which caused all services that depended on it to stop working. The high CPU left the host unresponsive and it shutdown. Linode Lassie noticed this and triggered a restart.

It did not recover gracefully from this restart, with numerous core services reporting an error, so we had to manually restart core system services using Lens in order to get things working again.

December 11, 2020
in Kubernetes, Database
4 min read

Postgres connection surge

At 13:24 UTC, we noticed the bot was not able to infract, and pythondiscord.com was unavailable. The DevOps team started to investigate.

We discovered that Postgres was not accepting new connections because it had hit 100 clients. This made it unavailable to all services that depended on it.

Ultimately this was resolved by taking down Postgres, remounting the associated volume, and bringing it back up again.