Understanding join strategies in PostgreSQL

Overview

We’re going to use the demodb database available at https://postgrespro.com/community/demodb. We at Metis provide a docker container with this database: https://github.com/metis-data/postgresql-demo-database

Specifically, we’ll just use tables ticket_flights with ~8 million rows and flights with ~200 thousand rows. We can find the schema at https://postgrespro.com/docs/postgrespro/10/apjs02.html:

‍

‍

Table flights has a primary key configured on the flight_id field. This means that the table is stored as a B-tree. Similarly, ticket_flights has a primary key configured on the tuple (ticket_no, flight_id).

Before moving on, let’s also set parallel scans to 0 with the following query:

SET max_parallel_workers_per_gather = 0;

Parallel scans don’t change the algorithms, so we can ignore them in the scope of this article.

Nested Loop Join

First and the simplest join strategy is called Nested Loop Join. It can be depicted with the following pseudocode:

For row1 in table1:
	For row2 in table2:
		If (row1 == row2):
			Add_ to_result(row1, row2)

We iterate over both tables with two loops, and join them naively. This has quadratic time complexity O(size(table1) * size(table2)). The memory complexity is O(1).

Let’s now see that in action. Take this query:

EXPLAINSELECT *
FROM ticket_flights AS tf
JOIN flights AS f ON f.flight_id < tf.flight_id

And here is the plan we obtained:

‍

Nested Loop (cost=0.42..16532324955.82 rows=601044021228 width=95)

-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)

-> Index Scan using flights_pkey on flights f (cost=0.42..1253.81 rows=71622 width=63)

Index Cond: (flight_id < tf.flight_id)

We can see that the engine decided to use an index to scan the flights table. No index was used to scan the ticket_flights table which is bad - scanning the whole table requires reading all of the rows which amounts to plenty of data. We generally always want to avoid scanning the whole table when filtering, but we want to read as little data as possible. Next, both of these scans are joined with the Nested Loop algorithm.

Let’s see if changing the order of joins matters. Take this query:

EXPLAIN
SELECT *
FROM flights AS f
JOIN ticket_flights AS tf ON f.flight_id < tf.flight_id

And here is the plan we get:

‍

Nested Loop (cost=0.42..16532324955.82 rows=601044021228 width=95)

-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)

-> Index Scan using flights_pkey on flights f (cost=0.42..1253.81 rows=71622 width=63)

Index Cond: (flight_id < tf.flight_id)

‍

We can see that the results are the same.

However, we can also see that changing the output aggregation can change the way we scan tables, but doesn’t change the algorithm. Let’s take this query

EXPLAIN
SELECT COUNT(*)
FROM ticket_flights AS tf
JOIN flights AS f ON f.flight_id < tf.flight_id

We get the following plan:

Aggregate (cost=18034924512.89..18034924512.90 rows=1 width=8)

-> Nested Loop (cost=0.42..16532314459.82 rows=601044021228 width=0)

-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=4)

-> Index Only Scan using flights_pkey on flights f (cost=0.42..1253.81 rows=71622 width=4)

Index Cond: (flight_id > tf.flight_id)

We can see that now we scan the flights table with Index Only Scan. This operation doesn’t even need to read the rows, it can get everything from the index which makes this operation even faster than the Index Scan. Next, scans are once again joined with the Nested Loop operation, and finally the Aggregate operation is executed to select the count of the rows.

Hash Join

Next strategy is called Hash Join. The Hash Join algorithm consists of two phases. In the first phase we build a hashtable from one of the tables that we want to join. In the second phase we iterate over the rows of the latter table, and then find the match in the hashtable. The algorithm looks like this:

For row1 in table1:
	hashtable.add(row1.id, row1)

For row2 in table2:
	Row1 = hashtable.get(row2.id)
	If (row1 == row2):
		Add_to_result(row1, row2)

The complexity is O(size(table1) + size(table2)) if we assume that the hashing algorithm is good and we have O(1) lookup time. The memory complexity is O(size(table1)) so the order matters. The engine generally prefers to hash the smaller table.

Let’s see that in action:

EXPLAIN
SELECT COUNT(*)
FROM ticket_flights AS tf
JOIN flights AS f ON f.flight_id = tf.flight_id

This is the plan:

Hash Join (cost=9767.51..302691.07 rows=8391852 width=95)

Hash Cond: (tf.flight_id = f.flight_id)

-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)

-> Hash (cost=4772.67..4772.67 rows=214867 width=63)

-> Seq Scan on flights f (cost=0.00..4772.67 rows=214867 width=63)

We can see that the engine decided to scan the flights table and then build a hash table out of it. Next, it iterates over the ticket_flights table and matches the rows based on the condition.

If we swap the join order in the SQL query like this:

EXPLAIN
SELECT *
FROM flights AS f
JOIN ticket_flights AS tf ON f.flight_id = tf.flight_id

then we get exactly the same plan:

Hash Join (cost=9767.51..302691.07 rows=8391852 width=95)

Hash Cond: (tf.flight_id = f.flight_id)

-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)

-> Hash (cost=4772.67..4772.67 rows=214867 width=63)

-> Seq Scan on flights f (cost=0.00..4772.67 rows=214867 width=63)

The engine is allowed to do so. SQL queries are declarative, so they define what the result is, but they don’t dictate how the result is calculated.

However, if we add the aggregation

EXPLAIN
SELECT COUNT(*)
FROM ticket_flights AS tf
JOIN flights AS f ON f.flight_id = tf.flight_id

Then we get this plan:

‍

Aggregate (cost=271560.70..271560.71 rows=1 width=8)

-> Hash Join (cost=8298.51..250581.07 rows=8391852 width=0)

Hash Cond: (tf.flight_id = f.flight_id)

-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=4)

-> Hash (cost=4772.67..4772.67 rows=214867 width=4)

-> Seq Scan on flights f (cost=0.00..4772.67 rows=214867 width=4)

You can see that the flights table is still scanned as before. There is no index scan this time.

Merge Join

Merge join algorithm is used when we can iterate the rows in order. It works like this:

Table1_sorted = table1.sort()
Table2_sorted = table2.sort()
Row1 = table1_sorted.first()
Row2 = table2_sorted.first()

while row1 is not Null and row2 is not Null:
	while row1 >= row2:
		if row1 == row2:
			Add_to_result(row1, row2)
		Row2++
	Row1++

The time complexity is O(size(table1)*log(size(table1)) + size(table2)*log(size(table2)) + size(table1) + size(table2)). The memory complexity is O(size(table1) + size(table2)).

However, if the data is already ordered, then we get O(size(table1) + size(table2)) for time complexity and O(1) for memory complexity.

‍

Let’s see that in action. First, disable the hash join strategy:

SET enable_hashjoin = off;

And then run this query:

EXPLAIN
SELECT *
FROM ticket_flights AS tf
JOIN flights AS f ON f.flight_id = tf.flight_id

We get the following plan:

Merge Join (cost=1520511.52..1676140.76 rows=8391852 width=95)

Merge Cond: (f.flight_id = tf.flight_id)

-> Index Scan using flights_pkey on flights f (cost=0.42..8245.57 rows=214867 width=63)

-> Materialize (cost=1520506.91..1562466.17 rows=8391852 width=32)

-> Sort (cost=1520506.91..1541486.54 rows=8391852 width=32)

Sort Key: tf.flight_id

-> Seq Scan on ticket_flights tf (cost=0.00..153851.52 rows=8391852 width=32)

We can see that the engine had to sort the ticket_flights table after scanning it. However, the flights table was already sorted because it has the B-tree already built for the primary key.

The reason ticket_flights table needs to be sorted is because the primary key consists of the ticket number and the flight id. However, the order of fields matters, so the flight id may not be stored in order.

Summary

The engine can choose how to calculate the join of two tables. Various algorithms have different time and memory complexities, so it’s useful to understand how we can speed things up. We can do that by adding indexes or making sure that we use conditions that allow us to use the more efficient join operations.

‍

PostgreSQL join types & strategies for database normalization

Overview

Nested Loop Join

Hash Join

Merge Join

Summary

Remove Show Stoppers To Move Fast

Garbage In, Garbage Out, And Monitoring In Between

Metis Lets You Stop Worry About Your Team

Never worry about your
database again!

Remove Show Stoppers To Move Fast

Garbage In, Garbage Out, And Monitoring In Between

Metis Lets You Stop Worry About Your Team

Never worry about your database again!

Contact Us

Never worry about your
database again!