Database consists of aggregate data stored in files. No matter if it’s SQL or NoSQL database, the data is ultimately stored on the hard drive. To make processing faster, each engine implements various techniques for memory management, caching, parallelization, or speculative execution.
All these techniques improve the performance by increasing the throughput or reducing the execution time, however, they are based on assumptions about how data is stored or what operations are executed. There are different cases in distributed databases or heterogeneous environments. Each operation may be affected by transaction isolation level and face a performance degradation. Therefore, we need to track database performance metrics to identify issues and fix the performance as early as possible.
Database performance metric is a key indicator describing some aspect of the operation. There are multiple areas which we can track, like memory subsystem, CPU consumption, disk usage, availability or throughput. We need to remember that the overall performance of the system will be affected if any of these metrics are affected negatively. For instance, high memory usage may lead to high CPU consumption that will leave less CPU cycles for processing of other tasks (like clearing memory pages) which will affect other transactions. In short, issues spread quickly.
Key Database Performance Metrics
Let’s now dig deeper into actual performance metrics.
Metrics related to the infrastructure
First metrics we need to track are the basics of the machine that our database is running on. These include available working memory (both paged-in and paged-out), CPU usage (for each core), available disk space, disk fragmentation, IO operation latency (and queue), network traffic, number of processes, number of threads, number of operating system handles, and many more.
These metrics should be provided by our hosting provider. They can be obtained from the operating system counters. No matter what is wrong with our database, these metrics should indicate there is an issue. However, very rarely do we need to fix these metrics per se. If the CPU spikes, then we need to understand what is running on the machine, not increasing the CPU performance directly.
We need to find the right balance between having room for unexpected spikes and good utilization of the system. We should aim for CPU utilization to be around 70% of the time, low disk fragmentation, 70% of the memory being used, etc.
Metrics related to throughput
Next group of metrics are metrics showing “how much we can do”. They include:
- Number of active connections
- Number of active transactions
- Number of active memory pages
- Number of opened files
- Transaction log size
- Number of cache misses
- Number of page misses
- Time for completing the query (average, p95, tm95)
Those metrics show how fast the system is. Again, we want to target 70% for the available resources being used, and we want to minimize the number of cache misses or page misses. For any cache we have, we want to have a cache hit ratio as high as possible.
Regarding the time for completing the query - it’s important to understand that the query must be as fast as needed but not faster. There is no point in reducing the query execution time to nanoseconds in general. Always understand that your performance is a feature, but human perception is also limited.
Metrics related to database performance
Now we focus on metrics specifically for databases. We should look for:
- Lock timeouts
- Number of delayed writes and for how long they were waiting
- How often the data is spilled to disk
- Length of the transaction log
- Data distribution and how much data we store
- How many nulls we have
- Slow queries
- Number of errors and warnings
- Number of scheduled tasks (vacuuming, index updates)
- Backups and other periodic tasks
These metrics should be grouped by the database and replica. They show ongoing processes in the database, and should be viewed periodically to make sure we don’t have any performance degradation. When things slow down, we should be able to correlate them with other metrics (like throughput or infrastructure) to identify scenarios like “many more users came to the platform that lead to longer query times and higher CPU usage”. Building a plausible scenario explaining the metrics is the first step for improving the performance.
Other metrics include indicators around security (number of users, number of account changes, how often there is an authentication error, configuration changes), logging (how many errors or exceptions we have, how many errors per specific error code, etc.).
These metrics can help us pinpointing things that can easily go unnoticed (like bruteforce attack), but should be actioned separately.
Analyzing Database Performance Metrics
Analyzing database performance metrics may not be trivial. We need to follow multiple leads and incorporate metrics from all the available layers to understand what is exactly going on.
First thing to understand is that metrics from various layers are connected. It’s very rare that we will see spikes in CPU usage without spikes in other metrics. High CPU usage will lead to slower query performance, delayed transactions, more locking contention, or issues with scheduled tasks. To improve the database performance, we need to understand all layers of metrics and find the issues that affect the system overall.
Second, we need to cure the source of troubles, not the metric itself. If we see a CPU spike, then we need to work on what caused this spike instead of just upscaling the CPU. Obviously, sometimes we just need to upscale our database, but most of the time we can deal with things around. That’s why we need to build a narrative that explains the metrics anomaly across all layers. For instance, the CPU spiked because there were more tasks running. We had more tasks because each task takes more time to complete. Each task works longer because the amount of data we store in the database doubled over the last three months. With a narrative like this one we can now answer the important question: do we want to hold this much data? If yes, then maybe we need to adjust the schedule of maintenance tasks. If not, then maybe we need to archive some data and make the database smaller again.
Third, we need to remember that metrics are interconnected. High memory usage will often lead to higher CPU usage because the CPU will spend time on dealing with fragmentation or cleaning garbage. Similarly, too small a limit for the transaction log length may lead to more disk activity which will degrade the performance. Whenever we deal with performance indicators, we need to look for the source of the problem. Acting on any breached metric will lead to hiding the issue for a short time, not solving the problem entirely.
Last but not least, we’re still humans. We need to have dashboards and alarms configured for the metrics. However, if we set too many alarms, we will face too much noise and won’t be able to act accordingly. Therefore, it’s crucial to have metrics and charts that show the reason behind the problem, not the raw metrics that we can’t reason about. See the Metis dashboard:
This dashboard doesn’t try to swamp you with details of each metric or alarms for every anomaly it detects. Instead, it connects the dots for you and makes sure you have all important information available instantly. When something goes wrong, you can dig into details and find what’s happening with one of the databases:
This way you can find the problem much easier, without being overloaded with metrics at start.
Finally, metrics must be grouped by logical dimensions. You need metrics per database, per replica, per version, per region, per country, grouped by date and time. This makes it easier to look for weekly patterns, country-specific behavior, or even tune alarms based on the characteristics of some particular region. There is no size that fits all. You need to adjust your solutions to your needs.
Implementing Effective Database Performance Monitoring
To implement the database performance monitoring the right way, we need to use proper tooling and systems that understand the broader context. We then need to tune alarms and settings, and finally get proper analysis and reporting.
Good tools will give you configurable dashboards that you can adjust to your needs. You can create dashboards per system or per region. You also need to integrate the tools with the rest of your ecosystem: paging system for alarms, metrics collector to get statistics from other data sources, your cloud and infrastructure provider, or even messaging platforms for showing details easily during status meetings.
Good tooling must be extensible. Each company is different, and you need to be able to configure the platform to match your needs. Especially in heterogeneous environments that utilize multiple platforms, databases, or operating systems.
Finally, good tooling must analyze data in real time. We can’t let the metrics be delayed or out of date. They must be as accurate as possible and always reflect the current state of the system.
When there is an anomaly detected, the performance monitoring system must alert us. Each alert must be configurable in multiple ways: how early is it triggered or how many data points it needs, whether it alerts on missing metrics, who is paged, how is it escalated, what is the SLA for the fix, and many more settings that each team needs to be aware of.
However, the most important part of alerts are the thresholds. We need to be able to configure which anomalies are acceptable, and which anomalies should be actioned immediately. Finding the right thresholds isn’t trivial and requires a lot of experience and hard work. Machine learning based solutions are also helpful because they can run anomaly detection for us based on the observed patterns.
Last but not least, we need reports about the performance. These reports should be reviewed manually every now and then. The team should take a look at the number of breaches, look for common sources of issues, and find ways to prevent them in the longer run. Some alerts can be ignored because they resolve themselves automatically, however, if they happen too often, then we should either relax the constraints or fix the problems.
Reports need to be generated automatically and include enough historical data to show the big picture. Performance tuning is never finished. We need to look for long-term trends and changes that we can improve in the future.
Good database performance monitoring system covers a wide range of metrics, including metrics around infrastructure, throughput, database performance, and operational metrics. They need to allow for configurable alerts that do not swamp administrators with raw numbers, but they show actual issues with the performance. They need proper tooling that we can configure to our needs, alerts based on automated anomaly detection, and reporting that we can review periodically to find recurring issues.
To improve the database performance we need to build a narrative that explains metrics across all layers. This is a never-ending story that we need to repeat every now and then. Our datasets grow over time, our business gets bigger and bigger. We can’t solve these problems once and for all, but we need to constantly monitor and improve. Good visualization and tooling that deals with databases is a way to go for building approachable monitoring metrics. They also increase our understanding of the system and make sure we run smoothly over time.
What are the key database monitoring metrics to consider?
Infrastructure (memory, CPU, disk, network), throughput (connections, transactions, files, logs, caches, pages), database (deadlocks, locks, timeouts, data distribution, slow queries, errors, exceptions), operational (users, authentications, authorizations, connections).
What are the primary and secondary metrics for performance analysis?
Primary metrics include aggregated numbers per server or database engine. Secondary metrics focus on split based on region, country, date and time, version, language, etc.
How do CPU utilization and resource consumption metrics impact performance?
CPU is rarely a bottleneck. More often it indicates issues with configuration, transactions, data size increase, or other aspects of the running configuration.
What are the important I/O performance metrics to monitor?
Number of writes, I/O queue length and latency, disk size, number of spills to disk, number of network connections, total network utilization.