Troubleshooting Database Issues Like a Pro

Welcome to our comprehensive guide on database troubleshooting, where we will delve into the intricate world of diagnosing and resolving common database issues. In this article, we'll equip us with practical solutions and best practices to optimize database performance, ensuring that our systems run smoothly and efficiently. We'll focus on three crucial aspects: troubleshooting SQL queries, managing CPU usage, and optimizing memory utilization. By the end of this guide, we'll have the tools and insights to not only tackle database issues head-on but also enhance our understanding of how databases function under the hood. Let's embark on a journey to unlock the secrets of database optimization and ensure seamless operations for our applications.

Unmasking Database Performance Hurdles: Tracing the Roots of Bottlenecks

Databases are complicated entities therefore many things can go wrong. When we analyze a performance problem, trying to find the root cause, we should usually start with the question: what has changed?

When a once-smooth-running database suddenly stumbles, the path to recovery begins with understanding all the moving parts and recent changes. It could be a new version of the code introducing inefficiencies, a surge in activity stressing the system, a data influx overwhelming resources, or infrastructure issues like a sluggish network.

Database performance issues often begin with connectivity. Sometimes, our database responds sluggishly, causing query delays. At other times, it becomes completely inaccessible. These problems can result from network issues, hardware failures, or rare file corruption.

If the database is reachable, our initial focus should be on identifying the slowest queries and determining the reasons behind their sluggishness. This involves discerning whether the slowdown is attributed to a single specific query or if all queries are experiencing extended durations compared to their usual performance. To achieve this, we need need to gather two essential pieces of information:

Identify the Slowest Queries: Begin by pinpointing which queries are taking the longest to execute. Isolating these queries is the first step towards resolving the performance issue.
Establish Previous Average Durations: Compare the current execution times of these slow queries to their average durations before the performance problem arose. This context will help us gauge the extent of the slowdown and provide insights into the scope of the issue.

By understanding both the queries causing delays (what) and their historical performance metrics, we'll be equipped to delve deeper into the root causes of the sluggishness (why) and take targeted actions to optimize their execution.

SQL Query Troubleshooting

When addressing query performance issues, it's important to note that the focus typically extends beyond individual queries and delves into query patterns, often referred to as "Normalized queries." This concept highlights that specific queries can have variations in their parameters while adhering to a common pattern.

For instance, consider the query pattern: SELECT * FROM customer_details WHERE customerID = $1 This pattern might be executed numerous times within an hour, each instance with a distinct value assigned to the parameter $1. In practice, this means that while the core structure of the query remains the same, it's the changing parameters that lead to variations in execution.

Every database should incorporate a robust monitoring solution. Cloud vendors typically offer basic built-in monitoring tools, but it's crucial to comprehend their capabilities and constraints. For instance, these tools may lack the ability to monitor schema changes.

Consider investing a bit more for enhanced functionality, like longer historical data retention. This investment can prove invaluable for gaining deeper insights into the database's performance and behavior.

Deploying a comprehensive monitoring solution in advance is particularly recommended, especially for self-managed PostgreSQL environments. By doing so, we ensure that we're well-prepared to tackle any performance issues or unexpected changes, ultimately maintaining the health and efficiency of your database.

Seeing How Queries Work: Making Complex Plans Simple for Better Performance

At the heart of query optimization lies a powerful tool: the execution plan. This tool provides a detailed roadmap for how a database engine will execute a specific query. By scrutinizing this plan, we can unlock the mystery behind why a query might be running slower than expected.

Consider a query like SELECT * FROM customer_details WHERE customer_id = $1 ORDER BY last_update DESC The execution plan for this query delves beyond the surface. We might not know the number of rows scanned to produce the output or whether the database engine utilized an index. The impact of the ORDER BY clause also remains obscure — did it require temporary files for sorting?

Some monitoring tools require configuring the database in advance to collect the execution plans for further analysis. That is often the case with postgresql troubleshooting, but may apply to other databases as well.

Execution plans can be tricky to understand because they're packed with technical terms and detailed steps about how databases do their work. But reading the execution plan doesn’t have to be complicated.

There are two free tools for visualizing and analyzing execution plans. With Postgres we should use Metis Query Analyzer and with MySQL we should use explainmysql.

Query Optimization Flow

The practical flow we can follow is:

Review the execution plan, focus on the steps with the longest duration, read the most rows or have the highest “cost”. Usually there is a strong correlation between the three. So the part that takes the most time also has a high cost and reads many rows.
How many rows the query reads?
Is it possible to minimize the number of rows read using indexes?
After adding indexes, does the database engine use them? Did that solve the problem?
How many tables are joined? Are they all really necessary?
Which columns are used by the query? Are they all needed? Focus on the columns with large data types, such as long texts, images, JSON and XML. Is it possible to return the large data in a separate query?
Review the join methods of the tables. Sometimes the datatypes of the joined columns are not exactly the same, causing data type casting, making indexes useless, which in turn, lead to table scans.
Review the sorting. Sorting many rows might be a very “expensive” step, especially when sorting many rows. In some cases, the engine needs to generate temporary files to store the large sorted dataset.
Review the amount of data returned by the query. A query might quickly find all the rows but sending them over the network might take a long time. Consider sending only the first batch of the rows and returning more upon request (pagination).

There are other insights we can check depending on the query we discuss.For instance, verifying type of an index for queries using JSONB, or training step for indexing with pgvector. Consult Metis to see what other insights are worth checking:

Dead-locks

In PostgreSQL, deadlocks are a scenario where two or more transactions are stuck in a circular dependency, each waiting for a resource that the other holds. This situation halts the progress of these transactions, causing them to become "deadlocked," as none of them can proceed without the release of the resource by another transaction. This deadlock state leads to a standstill, effectively preventing any further progress or completion of the affected transactions.

Deadlocks can occur due to various reasons, such as when transactions acquire and request locks on resources like database tables or rows in a conflicting order. If the transactions cannot proceed because they're both waiting for the other's held resource to be released, a deadlock ensues.

To mitigate deadlocks, PostgreSQL employs a deadlock detection mechanism. When a deadlock is detected, the database system identifies one transaction as the victim and rolls back its changes, allowing the other transaction to continue. While deadlocks cannot be entirely eliminated, careful transaction design, lock management, and monitoring can help minimize their occurrence and impact.

Databases and Memory Usage

In the world of databases, the term "memory" can mean different things. But mostly, we're talking about something called the "Buffer Cache." This is like keeping frequently used parts of tables and indexes in a special memory space. It helps make getting data faster because the stuff we need often is already waiting in this memory space.

Think of the database as a big collection of information. Sometimes, the database is bigger than the memory the computer uses. Imagine the memory like a special space where the database can keep some of its important things for quick access. But the memory isn't only for the database – other things also need memory, like the operating system and different services.

So, the database has to be careful and not use up all the memory. It needs to save some space for new things it wants to look at. These new things are like pages from a book, and they're called "buffers."

This is a normal thing for the database to do. It's like a cycle: it puts some stuff in memory, uses it, and then makes space for new stuff. Now, if there's more memory, it's like having more space to keep important pages from the book. This way, the database doesn't have to ask the computer's disk for these pages as often. This makes things go faster!

Cache Hit Ratio

Cache hit ratio is a way to measure how often the database finds the data it needs in its memory cache, rather than having to fetch it from the slower disk storage. It's like checking how often the database quickly gets information it already has in its memory.

A high cache hit ratio means that most of the time, the database is finding the data it needs in its memory cache. This is great for performance because fetching data from memory is much faster than getting it from the disk. On the other hand, a low cache hit ratio suggests that the database is often having to look for data on the disk, which can slow things down.

Connections Memory

In PostgreSQL, having too many connections open at once can use up a lot of memory. To prevent this, we use something called a connection pool. It helps control the number of connections and tame the memory management in databases.

When the server has around 500+ connections, it's a good idea to use a connection pool such as pgBouncer. Some DBA will argue that even as little as 200 connections require a connection pool. Since this is a known problem, some PostgreSQL as a service providers offer built-in connection pools. Others, such as AWS RDS, ask us to configure one ourselves.

Practical advice:

Monitor the configured size of the buffer cache.
Monitor the buffer cache, to check how efficient the queries are from the memory utilization perspective.
Use SSD hard drives, to make calling the disk faster. This is always true but very Important for low cache hit ratio databases.
Monitor the temp files usage, to detect the impact of queries who sort a large amount of data and the allocated memory is not big enough.
Since every new connection allocated memory (work memory) it is recommended to start using a connection pool when the server handles hundreds of connections.

Database and CPU utilization

Checking how much the CPU is used in a PostgreSQL database is crucial. When the CPU usage for PostgreSQL is really high, we need to figure out why. Maybe the queries are not written well, happening too much, or using a lot of power. It's important to solve this so that the database can work better.

Analyze Database Activity

It's a good idea to look at what each database is up to so we can figure out which ones are super busy. Then, we can decide if their activity looks usual or not. Sometimes, lots of little queries can end up using a bunch of CPU power. On other occasions, a few specific queries might be the ones hogging most of the CPU.

Start with understanding how many CPUs the DB server has. Remember that in the cloud the hardware might be shared, therefore, to be more precise, we are talking about virtual CPUs. In simple English, you might discover that you need to pay for many CPUs to get good performance.
Use dashboards for monitoring the overall CPU utilization of the server. Usually this is a built-in functionality of your cloud provider such as AWS CloudWatch. Else, the DevOps probably deployed other monitoring solutions such as Prometheus.
Remember that a database server on 50%-65% utilization is not necessarily a bad thing. It means the server is working hard and yet has enough resources for spikes.
Search for spikes. Some systems need to address short periods of high demands. In those periods of high activity (spikes) the CPU, memory and IO are under pressure. That alone is not necessarily a problem, as long as it doesn't affect the users. If the response time of the DB becomes slow, then probably more hardware is needed to handle the demand. Since the organization won’t pay for an expensive hardware just for those rare spikes, some cloud vendors offer plans that allow fast growing and then scaling down. The database “bursts” and then goes back to its regular resources specification.
As a developer, or an app owner, we are not expected to monitor the CPU utilization. However, we should also get alerts, with some context, when the CPU is high for long minutes or there is an impact of a new release on the CPU.
We can check which database uses the most CPU, memory, and IO. For ex. a high CPU on a database might be caused by running many long queries, which in turn read a lot of data from the disk.

Analyze the queries which consume most CPU

Every database has system tables to show which are the SQL commands which consume the most CPU. To detect anomalies, we usually need to distinguish between the top CPU consumers now and the historical CPU usage.

Once we detect the top CPU consumers, the next step is looking into the execution plan to understand why exactly the query requires so much CPU. Maybe it iterates over a large number of rows, or calls other functions for every row. Analyzing the execution plan, as explained above, can show the exact root cause of the problem.

Quick Solution - Scale Up

The obvious solution is to increase the number of CPUs. It is a relatively simple solution when a cloud RDS is used. However, do not apply this solution automatically, without understanding the root cause. For example, if the database is in the middle of building indexes, which might consume high CPU, then increasing the number of CPUs might have a negative impact. Stopping internal index creation might rollback the process, which in turn might lock the tables, making them inaccessible until the rollback completes.

Quick Solution - Scale Out Using a Read-Only Replica

Another quick solution is to create a read-only replica of the database. All main cloud vendors support this functionality, this is the easy part. The more complicated step is to redirect the queries to the read only replica.

If possible, redirect long running analytical queries on the read-only replica instead of the main read-write one.

Summary

Slow-running queries, increased CPU usage, and memory pollution can lead to decreased performance. Troubleshooting the database issues requires a holistic view of the database, incoming changes, data access patterns, and ongoing scheduled tasks. Taming all these challenges is much easier with a proper guide like this one.

FAQs

What are the most common database performance issues and how to resolve them?

The most common issues around database performance are related to a high CPU usage, excessive memory usage, slow queries, time-consuming scheduled tasks, and connectivity issues. Resolving typically takes understanding of the whole ecosystem, finding the root cause, and solving the ultimate problem.

What are the best practices to manage CPU usage in databases?

We should always look for the root cause. High CPU usage may be caused by too slow hardware, bloated memory, unexpected background tasks, fragmentation, or many other reasons. Just addressing the CPU usage is not enough.

How can I ensure efficient memory management in my database system?

Make sure you use the proper indexing, have buffers configured properly, do not extract unnecessary data, and keep a high cache hit ratio.

How do I troubleshoot PostgreSQL performance problems?

Analyze the database activity, find the source of spiking metrics, pinpoint it to a query or a task, and measure the performance before and after.

How to monitor CPU and memory usage in a database environment?

Use various dashboards: infrastructure dashboard provided by the infrastructure, application dashboard showing actual operations, and proper observability dashboard, such as one provided by Metis.