How to troubleshoot - overview
We need to keep in mind that the high CPU usage can be caused by various unrelated reasons. That could be just an expensive operation, or some memory issues, or false sharing, or whatever else. We saw multiple reasons in the previous part of this series.
We need to check multiple things and verify our assumptions at each stage. It’s worth thinking about recent deployments, weekly patterns, differences in running configuration, locale-specific details, running backend processes (CRON jobs, daemons, autovacuum), and other stuff.
In the rest of this post we are going to see various things that we can try.
Active and idle connections
First thing to check is how many connections we have in the database.
Number of connections
We can use the following query:
We can see the number of connections and how many of them are idle. You can also check number of maximum available connections with the following query:
This will give you the number of connections, like below:
Typically, the default is one hundred connections. You can always configure it if needed to match your needs, however, that will require rebooting your PostgreSQL instance. The more connections you have, the higher the memory usage is, so try to tune it to just keep it right above what you need
Connections per database
Next thing to check is which databases are used by the connections. You can use the following query:
You can also get similar result with executed queries:
The query gives the following output:
This gives you running queries at a given point in time. However, just running the query may not be a problem. We probably want to find long queries, so do that with the following query:
This shows you clearly which queries are running for a long time. Probably the one that runs for many seconds (minutes/hours) is worth investigating further.
Connections waiting for a lock
You can see connections that are waiting:
This gives the following result
By verifying this you can figure out which queries are slowed down because of other transactions.
Now we can focus on queries to figure out which ones are slow.
Queries that run often
Let’s see how many times a given query has been executed:
This gives the following sample output:
The query uses statistics from pg_stat_statements extension.
Metis can show you queries with statistics easily:
You can see long transactions with the following:
This can show you which transactions are long. They are most likely contributing to high CPU usage due to a lot of work they need to do.
Metis can show you the runtime of each query you execute:
Things may lead to an ongoing degradation of the database. Let’s see some of them.
Outdated statistics can cause high CPU usage as well. The query planner may generate non-efficient query plans that lead to much longer execution. You can find tables that have outdated statistics with this query (change the schema name for your needs):
You can update your statistics manually with this query:
Deleted entities may stay on the drive as dead rows to be removed later on. PostgreSQL removes these entities during the vacuuming process. You can see how many rows are dead with the following query (change your schema name):
This gives the following output:
You can vacuum your table with the following query:
Metis can show you the number of dead rows easily:
Unused indexes may lead to a higher number of operations. Database needs to update the table and also to update all the indexes.
You can see how indexes are used with the following query:
Metis can show you unused indexes easily:
Make sure that your index is not used. It is possible that it’s not used in your primary database, but is used in the replica. Double check all your databases.
You may need to clear history of index usage to get meaningful results in case an index was used before but isn’t anymore. You can do that with the following query:
We have seen reasons for high CPU usage and how to investigate. We can always rely on automated tools like Metis to identify all the issues for us. In the next parts we are going to see how Metis actually helped with issues in real life scenarios.