Hold your horses Postgres - how to debug high CPU usage Part 2

We need to keep in mind that the high CPU usage can be caused by various unrelated reasons. That could be just an expensive operation, or some memory issues, or false sharing, or whatever else. We saw multiple reasons in the previous part of this series.

We need to check multiple things and verify our assumptions at each stage. It’s worth thinking about recent deployments, weekly patterns, differences in running configuration, locale-specific details, running backend processes (CRON jobs, daemons, autovacuum), and other stuff.

Active and idle connections

Number of connections

WITH connections AS (
	SELECT
		COUNT(*) AS total_connections,
		SUM(CASE WHEN state != 'idle' THEN 1 ELSE 0 END) AS non_idle_connections
	FROM pg_stat_activity
),
max_connections AS (
	SELECT setting AS max_connections
	FROM pg_settings
	WHERE name = 'max_connections'
)
SELECT
	C.total_connections,
	C.non_idle_connections,
	MC.max_connections,
	ROUND((100 * C.total_connections::NUMERIC / MC.max_connections::NUMERIC), 2) AS connections_utilization_percentage
FROM connections AS C
CROSS JOIN max_connections AS MC

total_connections	non_idle_connections	max_connections	connections_utilization_percentage
8	1	100	8.0

We can see the number of connections and how many of them are idle. You can also check number of maximum available connections with the following query:

show max_connections;

max_connections
100

Typically, the default is one hundred connections. You can always configure it if needed to match your needs, however, that will require rebooting your PostgreSQL instance. The more connections you have, the higher the memory usage is, so try to tune it to just keep it right above what you need

Connections per database

Next thing to check is which databases are used by the connections. You can use the following query:

SELECT
	datname AS database_name,
	COUNT(*) AS non_idle_connections
FROM pg_stat_activity
WHERE state != 'idle'
GROUP BY datname
ORDER BY 2

database_name	non_idle_connections
demo	1

SELECT
	datname AS database_name,
	query,
	COUNT(*) AS non_idle_connections
FROM pg_stat_activity
WHERE state != 'idle'
GROUP BY datname, query
ORDER BY 3 DESC

database_name	query	non_idle_connections
demo	SELECT * FROM demo.bookings	1

This gives you running queries at a given point in time. However, just running the query may not be a problem. We probably want to find long queries, so do that with the following query:

SELECT
	datname AS database_name,
	query,
	now() - query_start AS runtime
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY 3 DESC

database_name	query	runtime
demo	SELECT * FROM demo.bookings	00:00:10.145824

This shows you clearly which queries are running for a long time. Probably the one that runs for many seconds (minutes/hours) is worth investigating further.

Connections waiting for a lock

SELECT A.pid, A.query
FROM pg_stat_activity AS A
JOIN pg_locks AS L on L.pid = A.pid
WHERE L.granted = false
GROUP BY 1, 2

pid	query
27621	SELECT * FROM demo.bookings

By verifying this you can figure out which queries are slowed down because of other transactions.

Queries

Queries that run often

SELECT SUBSTR(query, 1, 100) AS query, calls, total_exec_time, total_exec_time / calls AS call_time
FROM pg_stat_statements
ORDER BY 4 DESC

query	calls	total_exec_time	call_time
EXPLAIN (ANALYZE, BUFFERS, TIMING, VERBOSE, COSTS, FORMAT JSON) SELECT "tconst", "titlet	2	3970.404005	1985.2020025
SELECT "tconst", "titletype", "primarytitle", "originaltitle", "isadult", "startyear", "endyear", "r	2	3950.9494290000002

Long transactions

SELECT now() - xact_start AS transaction_time, query
FROM pg_stat_activity
WHERE state IN ('idle in transaction','active')
ORDER BY 1 DESC

transaction_time	query
00:00:07	SELECT * FROM demo.bookings

This can show you which transactions are long. They are most likely contributing to high CPU usage due to a lot of work they need to do.

Database hygiene

Things may lead to an ongoing degradation of the database. Let’s see some of them.

Statistics

Outdated statistics can cause high CPU usage as well. The query planner may generate non-efficient query plans that lead to much longer execution. You can find tables that have outdated statistics with this query (change the schema name for your needs):

SELECT
	schemaname,
	relname,
	DATE_TRUNC('minute', last_analyze) last_analyze,
	DATE_TRUNC('minute', last_autoanalyze) last_autoanalyze
FROM pg_stat_all_tables
WHERE schemaname = 'bookings'
ORDER BY last_analyze DESC NULLS FIRST, last_autoanalyze DESC NULLS FIRST;

schemaname	relname	last_analyze	last_autoanalyze
bookings	aircrafts_data	(null)	(null)
bookings	ticket_flights	(null)	2023-04-06 11:13:00
bookings	tickets	(null)	2023-04-06 11:12:00

ANALYZE TABLE_NAME

Dead rows

Deleted entities may stay on the drive as dead rows to be removed later on. PostgreSQL removes these entities during the vacuuming process. You can see how many rows are dead with the following query (change your schema name):

SELECT
	schemaname,
	relname,  
	n_dead_tup,
	DATE_TRUNC('minute', last_vacuum) last_vacuum,
	DATE_TRUNC('minute', last_autovacuum) last_autovacuum
FROM pg_stat_all_tables
WHERE schemaname = 'bookings'
ORDER BY n_dead_tup desc;

schemaname	relname	last_vacuum	last_autovacuum
bookings	boarding_passes	(null)	2023-04-06 11:11:00
bookings	tickets	(null)	2023-04-06 11:11:00
bookings	airports_data	(null)	(null)

VACUUM TABLE_NAME

Indexes

Unused indexes may lead to a higher number of operations. Database needs to update the table and also to update all the indexes.

SELECT
	idstat.relname AS table_name,
	indexrelname AS index_name,
	idstat.idx_scan AS index_scans_count,
	pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
	tabstat.idx_scan AS table_reads_index_count,
	tabstat.seq_scan AS table_reads_seq_count,
	tabstat.seq_scan + tabstat.idx_scan AS table_reads_count,
	n_tup_upd + n_tup_ins + n_tup_del AS table_writes_count,
	pg_size_pretty(pg_relation_size(idstat.relid)) AS table_size
FROM pg_stat_user_indexes AS idstat
JOIN pg_indexes ON idstat.indexrelname = pg_indexes.indexname AND idstat.schemaname = pg_indexes.schemaname
JOIN pg_stat_user_tables AS tabstat ON idstat.relid = tabstat.relid
WHERE indexdef !~* 'unique'
ORDER BY idstat.idx_scan DESC, pg_relation_size(indexrelid) DESC

table_name	index_name	index_scans_count	index_size	table_reads_index_count	table_reqds_seq_count	table_reads_count	table_writes_count	table_size
title_ratings_indexed	title_ratings_idx	0	28 MB	0	2	2	1297467	74 MB

Make sure that your index is not used. It is possible that it’s not used in your primary database, but is used in the replica. Double check all your databases.

You may need to clear history of index usage to get meaningful results in case an index was used before but isn’t anymore. You can do that with the following query:

SELECT pg_stat_reset()

Summary

We have seen reasons for high CPU usage and how to investigate. We can always rely on automated tools like Metis to identify all the issues for us. In the next parts we are going to see how Metis actually helped with issues in real life scenarios.

Troubleshooting PostgreSQL High CPU Usage - Part 2

How to troubleshoot - overview

Active and idle connections

Number of connections

Connections per database

Connections waiting for a lock

Queries

Queries that run often

Long transactions

Database hygiene

Statistics

Dead rows

Indexes

Summary

Keep Your Databases Fast

Harness Your Database Complexity

Prevent Your Data Workflows From Breaking

Never worry about your
database again!

Keep Your Databases Fast

Harness Your Database Complexity

Prevent Your Data Workflows From Breaking

Never worry about your database again!

Contact Us

Never worry about your
database again!