Category
5 min read

Troubleshooting PostgreSQL High CPU Usage - Part 2

Last time we saw what could be the reason for high CPU usage. In this blog post we are going to see some troubleshooting scenarios and queries that can help us identify and fix the issue.
Published on
May 24, 2023
Share this post
Contributors
Metis Team
Adam Furmanek
Dev Rel
See how Metis can make your database 3x faster and 50% cheaper!

How to troubleshoot - overview

We need to keep in mind that the high CPU usage can be caused by various unrelated reasons. That could be just an expensive operation, or some memory issues, or false sharing, or whatever else. We saw multiple reasons in the previous part of this series.

We need to check multiple things and verify our assumptions at each stage. It’s worth thinking about recent deployments, weekly patterns, differences in running configuration, locale-specific details, running backend processes (CRON jobs, daemons, autovacuum), and other stuff.

In the rest of this post we are going to see various things that we can try.

Active and idle connections

First thing to check is how many connections we have in the database.

Number of connections

We can use the following query:

WITH connections AS (
	SELECT
		COUNT(*) AS total_connections,
		SUM(CASE WHEN state != 'idle' THEN 1 ELSE 0 END) AS non_idle_connections
	FROM pg_stat_activity
),
max_connections AS (
	SELECT setting AS max_connections
	FROM pg_settings
	WHERE name = 'max_connections'
)
SELECT
	C.total_connections,
	C.non_idle_connections,
	MC.max_connections,
	ROUND((100 * C.total_connections::NUMERIC / MC.max_connections::NUMERIC), 2) AS connections_utilization_percentage
FROM connections AS C
CROSS JOIN max_connections AS MC
total_connections non_idle_connections max_connections connections_utilization_percentage
8 1 100 8.0

We can see the number of connections and how many of them are idle. You can also check number of maximum available connections with the following query:

show max_connections;

This will give you the number of connections, like below:

max_connections
100

Typically, the default is one hundred connections. You can always configure it if needed to match your needs, however, that will require rebooting your PostgreSQL instance. The more connections you have, the higher the memory usage is, so try to tune it to just keep it right above what you need

Connections per database

Next thing to check is which databases are used by the connections. You can use the following query:

SELECT
	datname AS database_name,
	COUNT(*) AS non_idle_connections
FROM pg_stat_activity
WHERE state != 'idle'
GROUP BY datname
ORDER BY 2

Example output:

database_name non_idle_connections
demo 1

You can also get similar result with executed queries:

SELECT
	datname AS database_name,
	query,
	COUNT(*) AS non_idle_connections
FROM pg_stat_activity
WHERE state != 'idle'
GROUP BY datname, query
ORDER BY 3 DESC

The query gives the following output:

database_name query non_idle_connections
demo SELECT * FROM demo.bookings 1

This gives you running queries at a given point in time. However, just running the query may not be a problem. We probably want to find long queries, so do that with the following query:

SELECT
	datname AS database_name,
	query,
	now() - query_start AS runtime
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY 3 DESC

Example output:

database_name query runtime
demo SELECT * FROM demo.bookings 00:00:10.145824

This shows you clearly which queries are running for a long time. Probably the one that runs for many seconds (minutes/hours) is worth investigating further.

Connections waiting for a lock

You can see connections that are waiting:

SELECT A.pid, A.query
FROM pg_stat_activity AS A
JOIN pg_locks AS L on L.pid = A.pid
WHERE L.granted = false
GROUP BY 1, 2

This gives the following result

pid query
27621 SELECT * FROM demo.bookings

By verifying this you can figure out which queries are slowed down because of other transactions.

Queries

Now we can focus on queries to figure out which ones are slow.

Queries that run often

Let’s see how many times a given query has been executed:

SELECT SUBSTR(query, 1, 100) AS query, calls, total_exec_time, total_exec_time / calls AS call_time
FROM pg_stat_statements
ORDER BY 4 DESC

This gives the following sample output:

query calls total_exec_time call_time
EXPLAIN (ANALYZE, BUFFERS, TIMING, VERBOSE, COSTS, FORMAT JSON) SELECT "tconst", "titlet 2 3970.404005 1985.2020025
SELECT "tconst", "titletype", "primarytitle", "originaltitle", "isadult", "startyear", "endyear", "r 2 3950.9494290000002

The query uses statistics from pg_stat_statements extension.

Metis can show you queries with statistics easily:

SQL Query Statistics

Long transactions

You can see long transactions with the following:

SELECT now() - xact_start AS transaction_time, query
FROM pg_stat_activity
WHERE state IN ('idle in transaction','active')
ORDER BY 1 DESC

Output:

transaction_time query
00:00:07 SELECT * FROM demo.bookings

This can show you which transactions are long. They are most likely contributing to high CPU usage due to a lot of work they need to do.

Metis can show you the runtime of each query you execute:

Long transactions Query

Database hygiene

Things may lead to an ongoing degradation of the database. Let’s see some of them.

Statistics

Outdated statistics can cause high CPU usage as well. The query planner may generate non-efficient query plans that lead to much longer execution. You can find tables that have outdated statistics with this query (change the schema name for your needs):

SELECT
	schemaname,
	relname,
	DATE_TRUNC('minute', last_analyze) last_analyze,
	DATE_TRUNC('minute', last_autoanalyze) last_autoanalyze
FROM pg_stat_all_tables
WHERE schemaname = 'bookings'
ORDER BY last_analyze DESC NULLS FIRST, last_autoanalyze DESC NULLS FIRST;

Output:

schemaname relname last_analyze last_autoanalyze
bookings aircrafts_data (null) (null)
bookings ticket_flights (null) 2023-04-06 11:13:00
bookings tickets (null) 2023-04-06 11:12:00

You can update your statistics manually with this query:

ANALYZE TABLE_NAME

Dead rows

Deleted entities may stay on the drive as dead rows to be removed later on. PostgreSQL removes these entities during the vacuuming process. You can see how many rows are dead with the following query (change your schema name):

SELECT
	schemaname,
	relname,  
	n_dead_tup,
	DATE_TRUNC('minute', last_vacuum) last_vacuum,
	DATE_TRUNC('minute', last_autovacuum) last_autovacuum
FROM pg_stat_all_tables
WHERE schemaname = 'bookings'
ORDER BY n_dead_tup desc;

This gives the following output:

schemaname relname n_dead_tup last_vacuum last_autovacuum
bookings boarding_passes 0 (null) 2023-04-06 11:11:00
bookings tickets 0 (null) 2023-04-06 11:11:00
bookings airports_data 0 (null) (null)

You can vacuum your table with the following query:

VACUUM TABLE_NAME

Metis can show you the number of dead rows easily:

Metis observability tool showing number of dead rows

Indexes

Unused indexes may lead to a higher number of operations. Database needs to update the table and also to update all the indexes.

You can see how indexes are used with the following query:

SELECT
	idstat.relname AS table_name,
	indexrelname AS index_name,
	idstat.idx_scan AS index_scans_count,
	pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
	tabstat.idx_scan AS table_reads_index_count,
	tabstat.seq_scan AS table_reads_seq_count,
	tabstat.seq_scan + tabstat.idx_scan AS table_reads_count,
	n_tup_upd + n_tup_ins + n_tup_del AS table_writes_count,
	pg_size_pretty(pg_relation_size(idstat.relid)) AS table_size
FROM pg_stat_user_indexes AS idstat
JOIN pg_indexes ON idstat.indexrelname = pg_indexes.indexname AND idstat.schemaname = pg_indexes.schemaname
JOIN pg_stat_user_tables AS tabstat ON idstat.relid = tabstat.relid
WHERE indexdef !~* 'unique'
ORDER BY idstat.idx_scan DESC, pg_relation_size(indexrelid) DESC

Output:

table_name index_name index_scans_count index_size table_reads_index_count table_reqds_seq_count table_reads_count table_writes_count table_size
title_ratings_indexed title_ratings_idx 0 28 MB 0 2 2 1297467 74 MB

Metis can show you unused indexes easily:

Metis observability tool showing unused indexes

Make sure that your index is not used. It is possible that it’s not used in your primary database, but is used in the replica. Double check all your databases.

You may need to clear history of index usage to get meaningful results in case an index was used before but isn’t anymore. You can do that with the following query:

SELECT pg_stat_reset()

Summary

We have seen reasons for high CPU usage and how to investigate. We can always rely on automated tools like Metis to identify all the issues for us. In the next parts we are going to see how Metis actually helped with issues in real life scenarios.

This is some text inside of a div block. This is some text inside of a div block. This is some text inside of a div block. This is some text inside of a div block. This is some text inside of a div block.

Never worry about your
database again!

Start using Metis and get your database guardrails set up in minutes