We often need to deal with long and unreadable SQL statements. There are multiple ways to increase their readability and maintainability in a longer run. Let’s start with some examples:
- UPPERCASE for keywords - we should write all keywords (like SELECT, FROM, JOIN) all uppercase to distinguish them from other parts of the query
- snake_case for identifiers and names - name your table customer_data instead of CustomerData (so use lowercase and underscore)
- Aliases - just rename your columns and tables in the query to improve the readability, especially if you join the same table many times
- Indentation - indent column names in SELECT part, indent subqueries, indent WHERE filters
- Use JOIN ON instead of JOIN + WHERE (even these two are equivalent)
There are probably other best practices that we can follow, especially depending on where we work and what software we use. However, all these tricks do not let us avoid repetition and make our queries significantly shorter.
In order to fix that, we can use two different approaches. They will make our code more readable and easier to maintain in the long run.
Common Table Expressions
Common Table Expression (CTE), also known as WITH expression, allows us to write an auxiliary statement to later refer to it in a larger query. It’s like a temporary table that exists just for one query. CTE can contain SELECT, INSERT, UPDATE, or DELETE, however, most often we use it with SELECT statements.
Let’s see an example. Imagine that we store data about orders in a given region.
We would like to show orders from regions that constituted at least 10% of the total sales. Let’s take this query:
We show region, product, and the total number of products sold in that region. We filter regions based on the condition that the sum of orders in a given region must be at least 10% of the sum of all orders.
However, we can see that this query isn’t very straightforward. It mixes where things are calculated and filtered, calculates sum many times, and doesn’t provide a clear understanding of how we calculate the threshold. Let’s now rewrite it with the WITH statement:
We first calculate the threshold as a temporary table with one column. Next, we calculate another temporary table with the sum of sales per region. We then find regions that performed the best. Finally, we just present the data.
We can see the query is much more readable and easier to understand. We can see that it’s not nested multiple times, things are “linear” and we can understand what’s going on just by following the query and meaningful names.
However, CTEs can allow us things that are not otherwise possible in standard SQL. CTEs can be recursive, so the temporary table can refer to itself to fill the content. Let’s see that in action:
We define a recursive CTE that has one column. We set the initial value to 1. Next, we use the UNION ALL to introduce the recursive term that can refer to the query’s own output. So we take all the rows that are below one hundred, and we create new rows with values increased by one. This generates numbers from 1 to 100.
A window function calculates values across a set of table rows related to the current row. This is very similar to aggregation, but window functions do not group rows into a single row. The rows are still independent, however, they can contain values grouped together.
Let’s take the example from the previous section but ignore the top region part:
So we take the region, product, and the total sales for a given region. With window function we can do the following:
We simply specify that the value for total_sales column should take all the rows for which the region column has the same value as the region value of the current row, and then sum all of them.
We can also include the ordering. Let’s say that we wanted to rank the products within a region. We can use the following:
We once again include rows from the same region, but this time we also order them by the decreasing amount, and we calculate the value with rank() function
There are many more functions that we can use:
- FIRST_VALUE - returns value of the first row in the partition
- LAST_VALUE - returns value of the last row in the partition
- LAG - returns value of the row before the row in the partition
- LEAD - returns value of the row after the row in the partition
- NTH_VALUE - returns value of the n-th row in the partition
We can also use the regular aggregate functions like SUM, MIN, MAX, etc.
Let us now see some practical examples.
Choosing n-th element from a group
Let’s say that we would like to group rows and select top three rows in each group:
We use CTE to number orders in a given region based on their rank. Then, we select rows for which rank is less than 4.
Let’s say that we would like to calculate the running sum, so sum the value of a given column for all the rows before the current row:
Difference between adjacent rows
Let’s say that we would like to show by how much the product is performing better than the previous one. We can do that like this:
So we take the current amount and subtract the amount of the product that was one row earlier using the LAG function.
It may seem that CTE and window functions bring benefits only. That’s not necessarily the case. Let’s take these two queries:
They are equivalent. The boarding_passes table has 8 million rows in total. The former query takes 13 seconds to complete, while the latter takes 8. They return the same result, but their performance differs greatly.
Common Table Expressions and window functions let you make your queries much more readable and maintainable. They also make some queries doable directly in the SQL. It’s always a good idea to explore whether they improve the readability in the specific case.