Database Testing: Why It's Important & How To Do It Right

Today’s industry standard for developing and maintaining databases is less than ideal. We don’t have a good way of preventing the bad code from reaching production. Once we deploy, we lack good tooling for observability and monitoring. Furthermore, we don’t have a way of troubleshooting and resolving issues reliably and automatically. Developers often don’t know how to proceed with their databases, and also often don’t own the solutions.

Furthermore, current approaches for testing applications do not tend to capture issues around databases. They don't find slow queries (that work fine in the local environment but are too slow in production), long migrations (that take production databases down for minutes or hours), or invalid configurations (like indexes not used effectively in production). Verifying these things can be slow and expensive as they require heavy load tests.

We need something new. We need a paradigm shift and a completely novel approach: database guardrails. We need a solution based on OpenTelemetry and runtime analysis that will push all these checks to the left. We need to provide semantic observability and monitoring instead of raw metrics, and aggregate signals from the entire software development life cycle (SDLC) loop to clearly present the necessary context to ease troubleshooting.

Introduction

Cloud and microservices have changed the operations world. A decade ago, most applications used one database, developers knew how it worked, and deployments happened rarely. A single database administrator was capable of maintaining the database, optimizing the queries, and making sure things worked as expected. When there was a problem with the performance, the database administrator could just step in and fix the issue. Developers didn’t need to understand the database, and even if they owned it, it was just a single piece. Testing was much easier because deployments happened rarely and things could be captured on time.

Now, everything is different. Organizations can have hundreds of applications, each one with its own database. Deployments happen every hour, CI/CD pipelines are working continuously, and keeping track of flowing changes is beyond any one individual's capabilities. The complexity of the software has increased significantly. In some cases, applications don’t talk to databases directly but instead use complex libraries that generate and translate queries on the fly. Application monitoring is much harder because applications do not work in isolation and each change may cause multiple other applications to fail.

Databases have changed as well. We have various SQL distributions, often incompatible with each other. We have NoSQL databases that provide different consistency guarantees and optimize their performance for various use cases. We developed multiple new techniques and patterns for structuring our data, processing it, and optimizing schemas and indexes. It’s not enough now to just learn one database, developers need to understand various systems and be proficient with their implementation details. This increases the conceptual load significantly.

Database administrators have a much harder time keeping up with the changes and they don’t have enough time to improve every database. Developers are unable to analyze and get the full picture of all the moving parts, but they need to deploy changes faster than ever. And the monitoring tools still swamp us with metrics instead of answers. Given all the complexity, we need developers to own their databases and be responsible for their data storage.

This “shift left” in responsibility is a must in today’s world, for both small startups and big Fortune 500 enterprises. However, it’s not trivial. How to best prevent the bad code from reaching production? How do we move from monitoring to observability? How do we efficiently troubleshoot issues automatically? Finally, how do we give developers the proper tools and processes so they will be able to own the databases?

Microservices, Databases, Observability, and Testing Everything

We need to thoroughly test our applications to find issues before deploying to production. This is a must for every product as letting bugs slip through the cracks is expensive because fixing these issues takes more time and money. Bugs make the application slow or unresponsive and may lead to outages that hurt the brand, increase customers’ frustration, and simply decrease revenue. Software engineers change their context and forget what they were working on weeks before the deployment. What’s more, bugs may lead to data inconsistencies or side effects in external or third-party systems that are even more expensive to fix.

However, it’s hard to build the right set of tests. Tests often focus too much on the business logic and not on the changes that affect databases (and data storage in general). They verify if we get the right result, but they completely ignore how we get it. And it’s not about asserting that a particular line of code was executed or not, but much more about whether our assumptions about non-functional requirements are still met. These tests disregard database changes that happen outside of the SDLC activities that programmers are exposed to.

To improve this situation, we need to take three steps. First, we need to prevent the bad changes from reaching production. We need to have a new way of identifying things that will not scale well, that may take the database down, or that may break the data quality or consistency. In a similar fashion to the shift towards DevOps and letting the developers own the deployment process, we need to let developers own their databases. However, we can’t just put the burden on them. We need to invent new approaches, build new tools, and rework our processes to make it bearable.

Second, we need to build better observability. Developers need to be involved in maintaining their databases even after the deployment because things may break at any time. If we build a video streaming service, then things that we deploy on Wednesday will tend to see peak usage on Saturday afternoon. Same for logistic applications – we will see the results of our weekend deployments only during weekdays’ peak hours. Database management doesn’t end when the deployment is done. This is a continuous story that developers need to be involved in. And to do that they need proper tools.

They don’t need metrics that are not actionable. They need very specific insights into what has happened and how to fix that. Current monitoring tools lack these insights and are not enough. They show raw metrics that are hard to untangle, especially if we run in a multi-tenant or virtualized environment when multiple services or databases are running on a single host. The tools typically just show generic metrics about CPU, memory consumption, or network traffic, but they can’t provide a good explanation of how things interoperate and whether the anomaly in metrics comes from the database, operating system, or some other tenant reusing the same host.

Third, troubleshooting must be much easier. We need systems that connect all the dots for us, provide context, and suggest improvements automatically. These tools need to be aware of the code changes, deployment times, configuration changes, and how things are interconnected. When the issue appears, the tools must be able to tell the full story, and not just show a bunch of metrics that the developer will need to understand.

All these three elements make successful database guardrails – a new approach for developers to own their databases. Let’s see what exactly may go wrong with databases to understand where our current methods fail.

Databases and What Can Go Wrong

Multiple things may break around databases. We are going to see the three big areas where things may go wrong: code changes, schema changes, and execution changes. Finally, we are going to see how our current tools make things even worse.

Code Changes

Many database issues come from the code changes. Developers modify the application code and that results in different SQL statements being sent to the database. These queries may be inherently slow, but these often won’t be captured by the SQL testing processes we have in place now.

Let’s start with the following example. Imagine that we have the following application code that extracts the user aggregate root. The user may have multiple additional pieces of information associated with them, like details, pages, or texts:

const user = repository.get("user")
  .where("user.id = 123")
  .leftJoin("user.details", "user_details_table")
  .leftJoin("user.pages", "pages_table")
  .leftJoin("user.texts", "texts_table")
  .leftJoin("user.questions", "questions_table")
  .leftJoin("user.reports", "reports_table")
  .leftJoin("user.location", "location_table")
  .leftJoin("user.peers", "peers_table")
  .getOne()
return user;

This is actually a code that I was once debugging in one of my production systems. The code generates the following SQL statement:

SELECT *
FROM users AS user
LEFT JOIN user_details_table AS detail ON detail.user_id = user.id
LEFT JOIN pages_table AS page ON page.user_id = user.id
LEFT JOIN texts_table AS text ON text.user_id = user.id
LEFT JOIN questions_table AS question ON question.user_id = user.id
LEFT JOIN reports_table AS report ON report.user_id = user.id
LEFT JOIN locations_table AS location ON location.user_id = user.id
LEFT JOIN peers_table AS peer ON Peer.user_id = user.id
WHERE user.id = '123'

Because of multiple joins, the query returns nearly three hundred thousand rows to the application that are later processed by the mapper library. This takes 25 seconds in total. Just to get one user entity.

However, the problem with such a query is that we don’t see the performance implications when we write the code. If we have a small developer database with only a hundred rows, then we won’t get any performance issues when running the code above locally. Unit tests won’t catch that either because the code is “correct” – it returns the expected result. We won’t see the issue until we deploy to production and see that the query is just too slow.

Another problem is the well-known N+1 query problem with Object Relational Mapper (ORM) libraries. Imagine that we have table flights that is in a one-to-many relationship with table tickets. If we write code to get all the flights and count all the tickets, we may end up with the following:

var totalTickets = 0;
var flights = dao.getFlights();
foreach(var flight in flights){
  totalTickets + flight.getTickets().count;
}

This may result in N+1 queries being sent in total. One query to get all the flights, and then n queries to get tickets for every flight:

SELECT * FROM flights;
SELECT * FROM tickets WHERE ticket.flight_id = 1;
SELECT * FROM tickets WHERE ticket.flight_id = 2;
SELECT * FROM tickets WHERE ticket.flight_id = 3;
...
SELECT * FROM tickets WHERE ticket.flight_id = n;

Just as before, we don’t see the problem when running things locally and our tests won’t catch that. We’ll find the problem only when we deploy to an environment with a sufficiently big data set.

Another issue can arise when rewriting queries to make them more readable. Let’s say that we have a table boarding_passes. We want to write the following query (just for exemplary purposes):

SELECT COUNT(*)
FROM boarding_passes AS C1
JOIN boarding_passes AS C2 ON C2.ticket_no = C1.ticket_no AND C2.flight_id = C1.flight_id AND C2.boarding_no = C1.boarding_no
JOIN boarding_passes AS C3 ON C3.ticket_no = C1.ticket_no AND C3.flight_id = C1.flight_id AND C3.boarding_no = C1.boarding_no
WHERE
  MD5(MD5(C1.ticket_no)) = '525ac610982920ef37b34aa56a45cd06'
  AND MD5(MD5(C2.ticket_no)) = '525ac610982920ef37b34aa56a45cd06'
  AND MD5(MD5(C3.ticket_no)) = '525ac610982920ef37b34aa56a45cd06'

This query joins the table with itself three times, calculates the MD5 hash of the ticket number twice, and then filters rows based on the condition. This code runs for eight seconds on my machine with this sample database.

A programmer may now want to avoid this repetition and rewrite the query to the following:

WITH cte AS (
  SELECT *, MD5(MD5(ticket_no)) AS double_hash
  FROM boarding_passes
)
SELECT COUNT(*)
FROM cte AS C1
JOIN cte AS C2 ON C2.ticket_no = C1.ticket_no AND C2.flight_id = C1.flight_id AND C2.boarding_no = C1.boarding_no
JOIN cte AS C3 ON C3.ticket_no = C1.ticket_no AND C3.flight_id = C1.flight_id AND C3.boarding_no = C1.boarding_no
WHERE
  C1.double_hash = '525ac610982920ef37b34aa56a45cd06'
  AND C2.double_hash = '525ac610982920ef37b34aa56a45cd06'
  AND C3.double_has = '525ac610982920ef37b34aa56a45cd06'

The query is now more readable as it avoids repetition. However, the performance dropped and the query now executes in 13 seconds. Since the query returns the same result as the one before, no automated tests will find an issue. We won’t catch the performance problems with our typical set of unit tests or integration tests.

Schema Changes

Another issue is schema management. There are generally three different ways of modifying the schema: we can add something (table, column index, etc.), remove something, or modify something. Let’s see how these things may cause trouble.

Recommended reading: Common Challenges in Schema Migration & How To Overcome Them

Adding a column seems to be safe. We just add a new column to a table, test all the queries, and typically nothing breaks. However, adding a column may take a lot of time. The database engine may need to rewrite the table – copy the data on the side, modify the table schema, and then copy the data back.

This may happen when we add a column between other columns and there is a clustered index on the table, or when we add a very wide column and we don’t have enough room in the pages. Our typical database performance testing suite will not find the issues because the schema modification runs outside of the tests. Also, when we modify the schema locally, we typically don’t see any performance issues because our local databases are small. However, rewriting a table may take minutes or even hours when done in production with sufficiently many rows. Deploying such a change may even take the application down. Even worse, we may not be able to stop the deployment when we realize that it’s taking too long because of the locks already in place on the table.

Adding an index also seems to be safe. However, they need to be maintained over time. Indexes generally improve the read performance because they help us find rows much faster. At the same time, they decrease the modification performance – every INSERT, UPDATE, or DELETE must now remove data in yet another place. This may lead to performance degradation. Just like before, this is not something that we’ll be able to capture with our test suite easily. Our unit tests will still be green because we don’t change the correctness of the algorithms, only their performance.

Removing is dangerous because of similar reasons. However, it can also lead to a disaster when we do a rolling deployment or when we run a heterogeneous system. When we do a rolling deployment, some nodes are already running the next version of the code, while others still run the previous one. Since the previous version may still rely on the thing we removed (index, column, table, etc.), it may start failing. To avoid that, we need to maintain backwards compatibility between version updates. However, our tests typically run against only one version of the code. They take the latest code (after changes) and verify it, so they won’t capture the issues with compatibility.

A similar case happens when we run a heterogeneous system. We need to update all the applications running in our server farm, and only then can we remove things from the database. Since our unit tests are typically testing just one application, it is much harder to identify issues or lack of compatibility.

In addition, that removal may lead to data loss. Any change that results in dropping something from the database needs to be carefully reviewed to make sure that nothing is lost. Restoring the database from a backup often takes hours and nearly always leads to a system outage.

Finally, modifying the schema is also dangerous. If we change the column type, then we may get wrong results or decreased performance. In some cases, our unit tests will catch this, but often these issues are missed. Imagine that you change the collation of a column (way of ordering characters in the string). This is a normal setting for our database, however, most of our programming languages can’t reflect that easily. We typically have just one string type in our language, and we can’t represent various requirements regarding comparing lowercase characters with uppercase ones or the order of national characters in the alphabet. This may be also very difficult to spot if we use non-production components in our tests, like an H2 database instead of the production-grade one. Changes in collation may simply go unnoticed.

Execution changes

Yet another area of issues is related to the way we execute queries. Databases prepare a so-called execution plan of the query. Whenever a statement is sent to the database, the engine analyzes indexes, data distribution, and statistics of the tables’ content to figure out the fastest way of running the query. Such an execution plan heavily depends on the content of our database and running configuration. The execution plan dictates what join strategy to use when joining tables (nested loop join, merge join, hash join, or maybe something else), which indexes to scan (or tables instead), and when to sort and materialize the results.

We can affect the execution plan by providing query hints. Inside our SQL statements, we can specify what join strategy to use or what locks to acquire. The database may use these hints to improve the performance, but may also disregard them and execute things differently. However, we may not know whether the database used them or not.

Things get worse over time. Indexes may change after the deployment, data distribution may depend on the day of the week, and the database load may be much different between countries when we regionalize our application. Query hints that we provided half a year ago may not be relevant anymore, but our tests won’t catch that. Unit tests are used to verify the correctness of our queries, and the queries may still return exactly the same results even when not performing optimally in production. We have simply no way of identifying these changes automatically.

ORMs and the Lack of Context

Another source of trouble comes from the tools we use. Developers may use some kind of ORM to run their queries against the database. However, ORMs hide lots of details from us.

We already mentioned the N+1 problem. One way of fixing it is to change the ORM to eagerly load the entities. However, based on the application code that we run, we have no way of telling whether the N+1 problem will appear or not. This is typically hidden somewhere in the ORM configuration. What’s worse, just changing the configuration doesn’t solve the problem. Just like with the example we saw earlier that resulted in sending three hundred thousand rows to the application, some queries may require eager loading, some may need lazy loading, and some others may even need to be rewritten entirely. But we won’t identify that just by looking at the application code in unit tests or during the code review.

Another problem can come up around transactions. Do you know what your default transaction isolation level is? Do you know where it is configured? Or do you even know if you can nest transactions and who decided whether to commit them or roll back? Those are questions that ORM needs to answer, and the answers are typically hidden from us and scattered around the code base.

Some ORMs may also modify the database schema, especially when they find a drift. This may not be a problem if the given ORM is the only user of the database, but may be much worse when we run multiple applications or have a heterogeneous cluster. Each ORM may have its own way of tracking the schema, some store the data in an additional table in the database, some others just run plain SQL scripts. Maintaining these things is something that we typically don’t test with unit or integration tests. We just apply the schema changes.

There are many more issues that ORM can cause. Caching, connection pooling, transaction scope, unsupported data types, and polymorphic execution with virtual functions are just examples. ORMs are great, but they also cause a lot of headaches.

The Testing Pyramid and Why It’s Not Enough

Until now we discussed things that may break when working with databases. Our goal is to prevent the bad changes from reaching production. Let’s see why our current approach won’t work.

Let’s examine the test pyramid. Martin Fowler explains that it is a metaphor that tells us to group software tests into buckets of different granularity. His article goes into detail about building proper tests and where we should focus the most. Let’s see the pyramid and why it’s not enough for our needs.

Unit Tests

Martin Fowler describes that unit tests make sure that a certain unit (your subject under test) of your codebase works as intended. Unit tests are supposed to verify the correctness of the application, check the public interfaces, and make sure that the business logic does what we expect it to do. We may argue if we should test trivial code, mock dependencies, or test private methods. However, no matter what our approach is, unit tests do not verify how things execute. They won’t capture issues around N+1 queries, won’t verify if the database uses indexes, and won’t check when the database engine decided to join tables differently. First, unit tests don’t even access the database as we typically mock it out. Second, even if we use an actual database in the unit tests, then we don’t assert how it executes the operations. Unit tests won’t protect our databases.

Integration Tests

Martin Fowler then moves to integration tests that test the integration of your application with all the parts that live outside of your application. They are supposed to verify if our application interoperates well with infrastructure around: web servers, load balancers, and databases. There are multiple issues, though.

First, integration tests don’t focus on how things execute. With integration tests we take the same approach as with the unit tests - we load the data, run queries, and verify if we got the expected entities or results in general. Integration tests check if our contracts between components are correct, not if the components work properly. To test the latter, we use unit tests much more. Integration tests simply have different goals.

Second, integration tests use different databases than the production ones. We typically go with either of two approaches: we create new infrastructure or reuse existing one. When we recreate the infrastructure from scratch using Infrastructure as Code (IaC), we spin up a new database cluster, fill it with some data, and then run our queries. While it sounds good enough, such a process doesn’t prepare the database the same way we run it in production. We most likely won’t have enough data in the database (because loading the data takes much time), statistics won’t reflect the production scenarios, and we won’t have the same extensions, configuration, or even database edition. Even Martin Fowler shows how he uses an H2 database that is unfortunately not close to the actual components he uses in production. The other approach could be to reuse some existing database, that is probably prepared for the team environment and reused by multiple developers. This may be even worse because the testing database may drift from the production settings and make our experience different. We probably don’t want to pay as much for the testing database as for the production one, so the testing database will have different hardware or editions. It will be often filled and cleared up, so the statistics will be much different. We may use different runtime configurations or extensions. We will have different partitions and the schema will often differ.

Third, integration tests do not react to changing production environments. They will test our code with some configuration obtained at a given point in time. However, if we later reconfigure our database, we most likely won’t rerun the tests with the new configuration. Therefore, changes in the production database won’t be reflected in the integration tests when these changes are applied. They will be taken into account when a new change is being tested, but that may be very late.

Fourth, integration tests assume the database is there. The database is created somehow, and the test uses the database to verify correctness. However, the test doesn’t check how the database was created, whether the migration was fast enough, or whether the configuration is the one that we expect. If the database works, then the test will just work as well.

Other Tests

Martin Fowler mentions other test types in the pyramid. They can focus on contracts, UI, and interoperation of all the elements. These tests run on different levels, but they all test with the same approach - prepare some input, run the test, and verify if the output is correct. These tests check if things work, but they don’t focus on how things work.

Interestingly enough, Martin Fowler doesn’t focus much on load tests. They are actually the only tests that may effectively capture at least some of the issues we mentioned before.

Why Load Tests are Not the Right Tool

Load tests are supposed to test the performance of our application. They seem to be exactly what we need to avoid the issues we listed before. However, they are not good enough either. Let’s see why.

Drawbacks of Load Tests

Load tests are expensive. To run a load test for a big web application we may need to spin up a fleet for hours. Depending on the type of our application, we may need to configure other services like caches, databases, load balancers, credential managers, and others. Paying additionally for yet another fleet of services to run load tests may be something we are not willing to pay for. This may be even harder for ML-enabled applications that utilize multiple GPU instances, or applications using some custom hardware. This may be just too expensive.

Load tests are slow. Running a load test can easily take 12 hours or more. Developers typically let these tests run overnight and check the results the following morning. This forces a context switch and a much slower feedback loop. If we need to run multiple load tests before production (because of syntax errors, wrong configuration, or whatever other reason), then this approach may slow our deployments for a week easily. This also increases frustration among developers when they need to spend much time doing the same thing over and over again.

Load tests may not reflect the actual data distribution. If we deploy the application in the US but test it using data from Europe, then the results we get won’t be reliable. Similarly, if we use testing data from a big country, then we notice performance issues with small countries contributing a single-digit percent of traffic to the platform. Getting the right data distribution is hard, and maintaining that data is even harder. While we can capture a proper dataset once, it’s difficult to maintain it over time. Sometimes we can just replay the production traffic directly to the non-production environment, but it’s not always possible if the application is stateful or due to legal implications.

Load tests need to use data safely. We can’t just take the production requests and run them locally because they contain confidential data. We need to adhere to GDPR, CCPA, and other local regulations. We need to redact the secret values, sometimes change distribution, and sometimes even remove some requests entirely. We need to pay attention to logs, databases, cache keys, configuration keys, and other places where confidential information may be stored. This is effectively yet another system that we need to maintain which takes time and is expensive.

However, the most important drawback of load tests is that they happen way too late. Most of the time developers run them after they merge their code to the main branch and run it through the pipeline. Developers then move to different tasks and focus on a new assignment. Load tests take hours to complete and developers get the feedback days after they merge the code. The code was already reviewed, verified, and tested, and now developers need to rewrite it or even start from scratch because the approach they took just won’t work. This hurts velocity and developer productivity significantly, and we can’t let that happen. This feedback loop must be much shorter; ideally, developers should get feedback when they write the code.

Which Issues Will Not be Captured

Even if we run load tests, we still won’t capture all the issues. Load tests run against an already deployed application. They don’t check the deployment per se. Therefore, slow schema migrations, changes in configuration, or even the deployments themselves won’t be tested by load tests. Developers may test these things explicitly, however, our automated pipelines typically don’t verify them at all. Tests will fail if the migration fails, and they will carry on if the migration succeeds. However, tests will not check how long it took to run the migration, add an index, or change the configuration.

Database Guardrails - The Ultimate Solution

Based on what we said above, we need a new approach. Whether we run a small product or a big Fortune 500 company, we need a novel way of dealing with our databases. Developers need to own their databases and have all the means to do it well. We need database guardrails - a novel approach that:

Prevents the bad code from reaching production,
Monitors all moving pieces to build a meaningful context for the developer,
And significantly reduces the time to identify the root cause and troubleshoot the issues, so the developer gets direct and actionable insights

We can’t let ourselves go blind anymore. We need to have tools and systems that will help us change the way we interact with databases, avoid performance issues, and troubleshoot problems as soon as they appear in production. Let’s see how we can build such a system.

What We Need for Database Guardrails

There are three things that we need to capture to build successful database guardrails. Let’s walk through them.

Database Internals

Each database provides enough details about the way it executes the query. These details are typically captured in the execution plan that explains what join strategies were used, which tables and indexes were scanned, or what data was sorted.

To get the execution plan, we can typically use the EXPLAIN keyword. For instance, if we take the following PostgreSQL query:

SELECT TB.*
FROM name_basics AS NB
JOIN title_principals AS TP ON TP.nconst = NB.nconst
JOIN title_basics AS TB ON TB.tconst = TP.tconst
WHERE NB.nconst = 'nm00001'

We can add EXPLAIN to get the following query:

EXPLAIN
SELECT TB.*
FROM name_basics AS NB
JOIN title_principals AS TP ON TP.nconst = NB.nconst
JOIN title_basics AS TB ON TB.tconst = TP.tconst
WHERE NB.nconst = 'nm00001'

The query returns the following output:

Nested Loop  (cost=1.44..4075.42 rows=480 width=89)
  ->  Nested Loop  (cost=1.00..30.22 rows=480 width=10)
    	->  Index Only Scan using name_basics_pkey on name_basics nb  (cost=0.43..4.45 rows=1 width=10)
          	Index Cond: (nconst = 'nm00001'::text)
    	->  Index Only Scan using title_principals_nconst_idx on title_principals tp  (cost=0.56..20.96 rows=480 width=20)
          	Index Cond: (nconst = 'nm00001'::text)
  ->  Index Scan using title_basics_pkey on title_basics tb  (cost=0.43..8.43 rows=1 width=89)
    	Index Cond: (tconst = tp.tconst)

This gives a textual representation of the query and how it will be executed. We can see important information about the join strategy (Nested Loop in this case), tables and indexes used (Index Only Scan for name_basics_pkey, or Index Scan for title_basics_pkey), and the cost of each operation. Cost is an arbitrary number indicating how hard it is to execute the operation. We shouldn’t draw any conclusions from the numbers per se, but we can compare various plans based on the cost and choose the cheapest one.

Such an execution plan explains everything that the database is going to do. It has various options, some of which are:

Format - we can get the plan in textual form or some well-structured format like JSON or XML
WAL - how much data is added to the Write-Ahead-Log to maintain transactions properly
BUFFERS - which memory buffers were used and to what extent

However, the most important is that the execution plan can be either estimated or actual. An estimated plan is a plan that explains how the database plans to execute the query. It estimates the number of rows, pages, used indexes, and other details. However, during the actual execution, the engine may decide that some estimates were inaccurate (like the number of rows filtered out) and the engine needs to pivot to some other operation. The actual plan represents the operation that has been actually executed! The actual plan clearly shows what the database did.

Having plans at hand, we can easily tell what’s going on. We can see if we have a N+1 query issue, if we use indexes efficiently, or if the operation runs fast. We can get some insights into how to improve the queries. We can immediately tell if a query is going to scale well in production just by looking at how it reads the data.

Once we have these plans, we can move on to another part of successful database guardrails.

Integration with Applications

We need to extract plans somehow and correlate them with what our application actually does. To do that, we can use OpenTelemetry (OTel). OpenTelemetry is an open standard for instrumenting applications. It provides multiple SDKs for various programming languages and is now commonly used in frameworks and libraries for HTTP, SQL, ORM, and other application layers.

OpenTelemetry captures signals: logs, traces, and metrics. They are later captured into spans and traces that represent the communication between services and timings of operations.

‍

‍

Each span represents one operation performed by some server. This could be file access, a database query, or request handling.

We can now extend OpenTelemetry signals with details from databases. We can extract execution plans, correlate them with signals from other layers, and build a full understanding of what happened behind the scenes. For instance, we would clearly see the N+1 problem just by looking at the number of spans. We could immediately identify schema migrations that are too slow or operations that will take the database down.

Now, we need the last piece to capture the full picture.

Semantic Monitoring of All Databases

Observing just the local database may not be enough. The same query may execute differently depending on the configuration or the freshness of statistics. Therefore, we need to integrate monitoring with all the databases we have, especially with the production ones.

By extracting statistics, number of rows, running configuration, or installed extensions, we can get an understanding of how the database performs. Next, we can integrate that with the queries we run locally. We take the query that we captured in the local environment, and then reason about how it would execute in production. We can compare the execution plan, and see which tables are accessed, or how many rows are being read. This way we can immediately tell the developer that the query is not going to scale well in production. Even if the developer has a different database locally or has a low number of rows, we can still take the query or the execution plan, enrich it with the production statistics, and reason about the performance after the deployment. We don’t need to wait for the deployment or the load tests, but we can provide feedback nearly immediately.

The most important part is that we move from raw signals to reasoning. We don’t swamp the user with plots or metrics that are hard to understand, or that the user can’t use easily without setting the right thresholds. Instead, we can provide meaningful suggestions. Instead of saying “the CPU spiked to 80%” we can say “the query scanned the whole table and you should add an index on this and that column”. We can give developers answers, not only the data points to reason about.

Automated Troubleshooting

That’s just the beginning. Once we understand what is actually happening in the database the sky's the limit. We can run anomaly detection on the queries to see how they change over time if they use the same indexes as before, or if they changed the join strategy. We can catch ORM configuration changes that lead to multiple SQL queries being sent for a particular REST API. We can submit automated pull requests to tune the configuration. We can correlate the application code with the SQL query, so we can rewrite the code on the fly with machine-learning solutions.

Database guardrails provide the actual answers and actionable insights. Not a set of metrics that the developer needs to track and understand, but automated reasoning connecting all the dots. That’s the new approach we need, and the new age we deserve as developers owning our databases.

How Metis stays ahead of Database Guardrails movement

Metis is your ultimate database guardrails solution. It prevents the bad code from reaching production, monitors all your databases, and automatically troubleshoots the issues for you.

How it works

Metis is a Software as a Service (SaaS) that integrates with your application and database, analyzes queries and schemas, and provides insights into how to improve your ecosystem.

To use Metis, we need to discuss how Metis SDK works.

An SDK is just a library that wraps OpenTelemetry and adds another sink that delivers spans and traces to Metis. It plugs into the regular OpenTelemetry features you most likely already have in your application. Many frameworks and libraries, including web, ORM, SQL drivers, already integrate with OpenTelemery and emit appropriate signals. Metis reuses the same infrastructure to capture metadata about interactions, requests, and SQL queries. You can read more about that in the documentation .

SDK is by default disabled in production to not decrease the performance of your application, and to avoid extracting confidential information. To install the SDK, one just needs to add a single dependency to their application with the ordinary package manager they use every day (like NPM or PIP).

Let’s now see what Metis can do for you.

Prevent the bad code from reaching production

Metis SDK captures interactions and visualizes them. Let’s see the main page of the project analysis:

Project page consists of application activity (middle), pull requests (left), and live databases (right). Let’s start with a sample application activity:

Metis captures requests coming to the application and immediately shows insights (column on the right). If we enter a particular activity, we get this:

Top part (number 1) shows the visualization of the interaction. We can see that the query took 13 milliseconds in total, and that there were two SQL queries sent along the way. First query (taking 5 milliseconds) is selected, and we can see (number 2) how many roads were read and returned in that query. This gives an understanding of what was going on.

The most important part is on the left (number 3). Developers should get immediate information whether the code can safely work in production (and all insights are green) or if there is something wrong (here indicated by red and yellow insights). Red insights are critical and they should be fixed before going to production, otherwise we risk a significant performance degradation.

Central part of the page shows details of the insight. We have practical information about what happened (number 4), developer-friendly explanation of the impact (number 5), and remediation plan (number 6).

All these pieces of information together create a very actionable insight. Developers can now see what happened behind the scenes. They can immediately recognize things that should be fixed before going to production, or even before sending the code for a review. If they want to understand more, Metis provides additional details tab (number 7). For instance, SQL query text (number 1) and table access details (number 2):

Query visualization:

Summary about interaction with each table:

Detailed metrics related to the database interaction:

Or the execution plan in JSON format:

All these details give the developer enough understanding of what happened and how to improve that.

We can wire this process into our CI/CD pipeline and get automated insights for the pull requests:

We can see two types of analysis: performance (just like the one we saw above) and the schema migration:

The analysis captures all the changes applied to the schema. If we open details, we see similar insights as for the performance tab:

Metis can analyze pull requests automatically. For instance, it can integrate with GitHub Actions and comment on the pull request directly:

This makes the CI/CD pipeline properly protected. Whenever there is a sub-optimal interaction detected, Metis will signal that, and you can configure your CI/CD pipeline to stop the changes from being deployed to production.

Summary

The world has changed significantly. We deploy constantly, run hundreds of microservices, and maintain multiple databases of various kinds. Our current testing solutions are not enough. We can't wait for load tests to find the scalability issues. We can use OpenTelemetry and CNCF standards to extract details of database interactions (execution plans, queries, configuration, etc.), and then apply some reasoning on top of them.

This way, we can build proper database guardrails in our CI/CD pipelines and get feedback much faster to prevent bad code from going to production. We can now connect all the dots to monitor and automatically troubleshoot databases.

You Must Test Your Databases And Here Is How