What could be the reason
High CPU usage can be caused by a very expensive calculation. This is the very first thing that comes to our mind. However, it is also a manifestation of other issues around memory management, lock contention, or wrong assumptions made by the SQL engine. In this section we’re going to see some general computer science reasons for a high CPU usage.
Some operations are hard to calculate in terms of CPU power. While addition and subtraction are pretty cheap, division and modulus can be harder, but the real killer are trigonometric operations and exponentiation in general. They can be hundred times slower than the pure addition operation (baseline).
Another group of expensive operations is cryptographic functions. They are made slow by design to protect the application from timing attacks. For instance, good hashing function will do many more repetitions just to spend more computing time.
Depending on your situation, you may get some improvements in multiple ways. Some operations provide faster versions, for instance there can be a fast sinus function that is implemented with AVX and SSE. While these functions provide lower precision, they are much faster in terms of CPU cycles number.
Another way is to replace functions with simpler equivalents. This can be a better algorithm (like calculating Fibonacci number with matrix exponentiation instead of the loop, or using a fast exponentiation algorithm), but this can also be trigonometric identities or other equivalent formulas that give the same result. Sometimes it’s beneficial to not perform calculations if the result doesn’t depend on the actual operations. For instance, if you want to find the closest point, then you don’t need to take the square root of the sum of powers. You can compare the sum of powers directly, as the square root operation doesn’t change the outcome.
Yet another way is calculating things once. Caching, table lookup, and other ways of persisting the result of temporary operations is always beneficial.
However, you always need to measure your improvements. For instance, caching the result may sound like a good solution, but let’s take the following example:
Here we calculate the hash of the ticket number, and we store the temporary result in CTE. We may think that the query above is equivalent to this one:
While they are equivalent in terms of the results, they are much different in terms of performance. The former runs in 13 seconds while the latter finishes in 8. The lesson is: always measure your improvements.
Memory is nearly always accessed via pages. Memory page is a block of bytes, typically 4kB long (that depends on the CPU architecture). The page is addressed by the virtual address, so we nearly never touch the physical memory directly. The CPU needs to map this virtual address into a physical address that points to a frame. Frame is again a block of bytes stored in the physical memory.
Because one frame can be used by multiple applications (think of the same binary file executed by multiple processes), one frame can be mapped into multiple virtual addresses. This saves the memory (because each application uses the same memory frame), but makes memory management a little bit slower due to additional level of indirection in addressing.
CPU and the operating system need to take care of loading memory frames and flushing them to the drive. This process is typically very fast, however, it cannot be neglected. Sometimes loading and unloading the memory takes a significant amount of time, especially when we are short of physical memory (so the system is heavily overloaded).
Another interesting phenomenon is Bélády's anomaly in which if we increase the number of frames (amount of physical memory), then we increase the number of page faults (situations when needed page is not available in the physical memory) which results in more page loads. This can decrease the performance.
Since pagination is transparent to our applications, we don’t get any exceptions or errors when it happens. The only thing we can observe is the decreased performance because the memory access is just slower. Pagination can be especially painful when we need to sort entities or join tables.
Memory paging is also strongly coupled with the order of pages we access. For instance, one of the common optimization phenomena is the way in which we traverse arrays in our programming languages. C-based languages use the row-major order, while Fortran-based languages go with column-major. Row major means that when we have two-dimensional array, then the elements are stored row after row, so like this:
With row-major order we should traverse the array in the ij order, so with the following code:
With column-major order we should go with ji order instead. This is especially important with matrix multiplication where the ikj order is much faster than the ijk one.
This is very related to the pagination we discussed in the previous section. However, this time we focus entirely on the SQL engines.
In this case the SQL server decides to spill temporary results to the drive. While there is enough memory for the SQL engine (so the amount we configured is available and accessible by the engine), the SQL has so many running operations that it needs to store temporary results on the drive. For instance, the engine may need to sort rows, calculate transient in-memory hash tables, calculate temporary tables, etc. The engine doesn’t use more memory (either because of the limits or because of the swapping), so the engine needs to flush results to disk.
We can help the situation by decreasing the load on the SQL engine, or by rewriting the queries to use less data, or by calculating temporary tables manually instead of using them implicitly (via subqueries or CTEs). Again, many optimizations will apply to the running configuration of the server, so if we change the parameters, then the optimizations may stop working (or even decrease the performance).
Locks and contention
Locks and latches are used for every transaction. They can be used explicitly (depending on the isolation level), or implicitly (for instance to traverse the structures). If multiple transactions access the same data, then we may get a lock contention.
However, one of the big issues with locks is the so-called false sharing. Because things are cached or accessed concurrently, we may accidentally need to lock the same memory address even if we don’t use it. See the following code:
We have an array of 64 integers. We create two threads, both of them need to loop and increment an independent integer. Seems like there is no locking in here and nothing can slow us down. However, because the CPU caches the data (with MESI protocol), and the cache line is larger than one integer (typically it’s 16 integers or more), then the memory needs to be flushed from L1 CPU cache to RAM every time any of the threads performs the modification because both threads operate on the same cache line.
False sharing can pop up in multiple scenarios. For instance, take the following scenario:
- First transaction appends entities at the end of the clustered index
- Second transaction reads first entities from the index
Do these two transactions compete for any resource? Seems like they do not. However, when you add a new page to the index, you may need to update the root B-Tree node. To update the node, you need to lock it. So even if the other transaction doesn’t modify the node (because it just reads the data), it still needs to wait for the lock to be released. This can decrease the performance significantly.
There are many more examples when the application needs to wait for locks to be available. For instance, each memory allocation may need to acquire a heap lock (provided by the operating system). Each dynamic type creation may require taking the lock for the internal structures. False sharing and memory contention is so often, that some applications decide to implement their own synchronization primitives to improve the performance. Some SQL engines even decide to implement their own operating system (like Microsoft SQLOS).
Statistics in the SQL engine are used to plan the query execution. The engine may perform operations in multiple ways, for instance joining two tables can be one with one of multiple strategies. The engine decides how to execute the operation based on the number of rows, data distribution, indexes, configuration parameters, and many other things. Most of them are stored as statistical data about the tables.
Statistics are stored and updated automatically. The engine refreshes them when a significant number of entities has been modified. We can enforce recalculation of the statistics, but we typically don’t fiddle with them on our own.
Statistics include details of number of rows, row sizes, data distribution (number of null values, etc.), and other helpful aggregates that help the engine decide what to do and how to do it. However, statistics may get out of date, especially when we modify entities often. This may lead to nonoptimal operations, for instance accessing tables in a nonoptimal order or using less efficient join strategy.
We need to keep an eye on the statistics and refresh them when they become stale.
Delayed updates and fragmentation
When we modify the data in the table, the engine is free to mark rows as tombstones (or dead rows). This means that rows are still stored on the drive, but are marked as removed and not applicable anymore. This leads to fragmentation.
Fragmentation can be internal and external. Internal fragmentation is the situation when we have some empty space inside the page that we can’t use anymore. The page is empty and effectively wasted. However, when scanning the tables, we need to load this memory and skip over it. This takes time.
External fragmentation occurs when the order of the pages on the drive is different from the logical order of the entities. Even when we store the table as a clustered index (which enforces the logical order of rows), the data can still be stored out of order. Imagine that you have entities with identifiers from one to fifty and from eighty to one hundred. If you now add rows with identifiers between fifty and eighty, then these rows need to be stored between identifiers fifty and eighty. However, there is no physical space in the index. In that case the engine may decide to create an additional memory page on the side. When scanning the index, the engine may need to “jump” between pages to get the data with the correct order.
We can remove fragmentation by reorganizing or rebuilding the indexes, by removing dead rows, or with an autovacuum process. There are multiple operations that the database can run behind the scenes to improve the quality of the storage. However, these operations may be CPU intensive and result in high CPU usage.
We have seen multiple reasons for the high CPU usage. In the next part we are going to discuss ways of troubleshooting the issues. We’ll see how to investigate what is the reason for the CPU usage and how to fix it.