Developers should use the product they work on. This is called dogfooding or eating your dog food. Wearing the user’s hat is very different from using the product during development. We understand that at Metis and so we use Metis to debug our platform.
We configured Metis to monitor all our databases, provide insights, and automatically track all the changes we have. We periodically review the dashboards and look for anomalies. Metis tracks many different characteristics, one of them being Postgres open database connections. We track this metric because unused connections are detrimental to the database performance. They consume resources and bring no value. They can lead to the issue “pg: too many connections”. Generally, we should avoid keeping too many open connections as we explained in our PostgreSQL Recommended Configuration article.
Typically, we should see the number of active connections follow a typical weekly pattern. Probably increasing during the day and decreasing over night. That depends on the characteristics of your application.
One day we started a deployment, and things started to break. Let’s see what happened and how Metis helped us.
Connections Going Up After The Deployment
We were in the process of releasing a new version of the application. The new version changed the way we managed the connections to the database which resulted in opening many new connections every minute. However, the connections were not released properly. This resulted in increased memory usage and a lack of free memory over time.
Metis showed both metrics properly. You can see that on the observability dashboard:
The screenshots were taken after we fixed the issue. With Metis you can check Postgres open connections:
You can also check Postgres free memory and see that it went down and then back up after deploying the fix:
These metrics pointed us in the right direction. We modified the code to close active connections as soon as possible, and this fixed the problem.
No Customer Impact
After the issue, we went through the process of post-mortem. We wanted to analyze the impact of the issue, whether any corrective actions were needed, and whether we could identify the problem with other tools.
We identified that customers were not affected. We managed to capture the issue before it hit the production customers, so there was no need to fix the production data.
However, we wouldn’t catch the issue this easily with AWS metrics alone. AWS shows the number of connections but doesn’t show their state. We wouldn’t know if these connections are held by the database or if they were some kind of “zombie” ones. On the other hand, Metis clearly showed that there are many active connections and that is the problem.
We are positive that we wouldn’t fix the issue this easily without Metis in place. That’s a great example of dogfooding.
Dogfooding is important. It lets developers understand the product better and build things that users need. In our case, Metis protected us from breaking the production data and affecting our customers. This is truly an example of our first core value: prevent the bad code from reaching production. We are positive we wouldn’t fix the issue this easily with other tools.