Auditing - what’s that
A long standing question in computer science is whether we should ever remove the data. This applies to not only deletion per se, but also to data modification (think of UPDATE query), data deprecation (think about migrating data between the data centers), or even data ingestion (when parsing data or via an ETL system).
Why wouldn’t we want to delete the data? There are multiple reasons:
- Business data is the actual value that the company has. Deleting it seems like getting rid of the market advantage
- We may need to keep the data for legal reasons - regulations, ongoing agreements, history of what happened
- Data may be deleted by accident, so it’s better to keep it around to be able to recover it easily
- A malicious agent may delete the data, and we should be able to recover them, or even prevent the deletion from happening
However, deleting the data may be easier from the technical standpoint:
- We always need a way to mark some entities as deleted even if we don’t delete them physically. We need a flag or marker indicating whether a given entity is deleted or not. This makes database interaction much harder because we need to include additional filters like is_deleted = false in every single query we run
- Storing “deleted” data along with the non-deleted one makes the data source bigger, especially over the years. This may lead to slower queries or more complex partitioning schemes.
- It makes maintaining uniqueness harder. We can’t make identifiers unique because when we want to modify the row then the new row should have the same identifier. We need to include entity version in the uniqueness check
In short, it is not trivial to not delete the data, and this may lead to multiple technical challenges. However, there are good reasons to not remove any entity. How can we deal with it? The solution is to keep a log of all of the changes in our data source. This is called auditing.
In the rest of this post we’re going to cover some auditing solutions, and then analyze how they perform.
How to implement auditing
So we decided to implement auditing. Before going into actual solutions, let’s first see what requirements we would like to meet.
First, this should be fast. We don’t want to unnecessarily slow down the production server. The solution should be efficient enough to not break the production metrics to an unacceptable level.
The solution should support all possible changes. It should work with adding new rows, deleting existing ones, modifying values, and should support all column types. In the case of modifications, it should also be able to store not only the new values, but also have access to the old values, so we can easily compare the difference.
The solution shouldn’t interfere with existing processes. If there are long-running queries or operations, then auditing shouldn’t stop them from running or cause excessive delays.
The solution should work generally in the same way as “regular” data source. Ideally, we would like to use the same mechanisms as in a regular production usage to get the data from the audit log.
We would like to control what is being logged and how. Maybe we want to recreate this log from some other data, maybe we want to extend it with additional fields, maybe we want to merge multiple data sources into one. Ideally, all these scenarios should be supported.
The log should be readable and easy to query. It should provide enough details to understand the context, but shouldn’t be overloaded at the same time.
Finally, it should be easy to configure and understand. Ease of usage is always a nice-to-have feature.
The simplest way in terms of configuration is just enabling extensive logging in the SQL database. This allows to store all the queries in the query log which can be examined later. Every SQL engine should support that. For instance, PostgreSQL can be configured with log_statement = all.
The challenge here is the readability of the log. Since this logs statement, it may be hard to see what really happened without understanding the code and the context. For instance, the following command:
will create a table named important_table, but we won’t see this name directly in the log. We will need to search all statements and understand their real meaning.
Similarly, changes to the data may be captured with statements, so it may be much harder to reason about rows and their values.
Depending on which SQL engine we use, there may be extensions to this mechanism that would make the log more readable. For instance, pgAudit for PostgreSQL can extract the statement in some standardized form that will make searching the log much easier.
- Logging is a configuration feature and typically doesn’t require any code changes
- This is directly supported by the database engine
- Raw logs may be hard to search and understand
- Reasoning about entities may be much harder because of statement-level logging
- Logging may result in very big log files being stored
First solution we’re going to see are triggers.Trigger is a procedural code that gets executed as a response to some database event. This could be data insertion, deletion, or modification. We can run the code and do whatever we want. Typically, we want to get the new and old values, and insert them to some other table to keep the log of the changes.
Let’s take the following example for PostgreSQL database:
We first create the function that will be called for each row modification. We can see that we run INSERT in the function. INSERT statement inserts new row into the audit table and accesses both old and new values.
We then create the trigger that runs after each update on the accounts table. It will run when the UPDATE is completed, and will be executed for each row separately.
There are multiple configuration options for triggers:
- They can execute before, after, or instead of the operation
- They can be executed once per row or per statement
- They can operate with tables and views
- They can execute any function, so we’re free to do whatever we want
- They can provide conditions to be checked before running the trigger
- Triggers are highly configurable
- They can do whatever we need, so we are free to implement audit tables of whatever shape or content
- They are typically supported natively by the SQL database
- They may be slow
- They may be hard to maintain because they are written in the SQL language
- We don’t see that they are executed just by looking at the table
- The order of the triggers may not be clear when there are multiple of them in place
- Triggers require writing custom code and maintaining the audit tables
Query rewrite rules
Some database engines support query rewrite rules. When a query is sent to be executed, the engine rewrites it based on some logic and modifies how the query works. The engine can also generate new queries to be executed alongside. This allows us to do some additional work at the time of query execution.
Let’s see this example from PostgreSQL:
This rule specifies that whenever there is an update to the accounts table, then we want to also insert into the audit_log table. We can configure multiple things with rules:
- Rules can have multiple actions
- They can run alongside the query or instead of it
- Rules can have some conditions when they run
However, rules can be hard to understand when we run functions in updates. A single function may be executed multiple times. This may be a problem if the function returns different values each time.
When it comes to the performance, rules are typically faster than the triggers. Rules create new query trees that are executed next to the original statement (or instead), and so don’t require specific code execution with each row.
- Rules can do whatever change we need since we can run custom code
- They are typically supported by the engine itself or by an extension
- Rules are considered hard to reason because of their “unpredictable” behavior when used with functions and procedures
- They may not be supported in the future. For instance, PostgreSQL considered dropping the feature entirely
- Rules require writing custom code and maintaining the audit tables
SQL standard now supports a mechanism called Temporal database. It is used to represent the following details about the world:
- There are facts like who is the president of the USA. However, facts may be unknown or an untrue facts may be spread around the world
- Facts may come from decisions. For instance, shorty after the elections we know who will become the president, but the new president will be sworn in much later in the future
- Computer systems may learn about the decision with a delay. For instance, the elections committee finished counting ballots, but hasn’t yet entered the data into the system for some reason
Depending which dimension we would like to use, we’ll get different answers for a question “who is the president of the USA”. What’s more, if we discover that we were lied to about some facts, then we’ll need to update the database. The fact hasn’t changed, however, our database was wrong for some time and had to be corrected.
Temporal databases are useful for various purposes, however, we can limit ourselves just to use them for implementing the auditing. We change the regular table into the temporal table, and then every change to the table is logged separately.
Let’s take the following example from Microsoft SQL Server:
This creates a table with two “business” columns (id and value), and some additional columns for maintaining the history. Since ValidFrom and ValidTo are marked as hidden, the following query:
will return id and value only. It will not return the additional columns. We can always ask for them explicitly with
And another important part is that we can query the history with this query:
Temporal tables can be supported either natively or via other mechanisms (like triggers).
- Standardized feature available out-of-the-box in various SQL engines
- May be supported directly in the database
- Historical table is yet another table that can be queried “the regular way”
- Temporal tables require one-time configuration and little-to-no maintenance of the audit tables
- We can’t make the history manually easily (it may be impossible to add rows in the past)
- We can’t run any code we want on the data modification (so we can’t tune logging)
- Once enabled for a table, it works all the time. We can’t control the audit on a case-by-case basis
Auditing in the application code
Last but not least, we can implement audits in our application code. We need to extend the business transaction with additional INSERT statements adding changes to the log tables. This gives us flexibility in terms of what we want to log and how.
Since this is an application code, we can store the audit data outside of the main SQL database. However, this may require a separate transaction, so we should be careful during the implementation.
The biggest challenge is with making sure that we log changes in case of bugs. If there is a bug in the auditing logic, then it shouldn’t affect the business part of the transaction and vice versa. However, since we would like the auditing code to be executed in the same transaction, then this may be hard to achieve and require nested transactions.
Another important aspect is making this solution in a generic manner. Enabling auditing and maintaining the solution may be time-consuming.
- This is database agnostic and can be migrated to different data sources
- Audit can span multiple SQL engines or even heterogeneous data storages
- We can run any code we want
- This is an error-prone solution. A bug in the auditing code may go unnoticed for a long time
- Audit entries should be created in the same transaction. However, inserting audit entries may cause the transaction to fail impacting the correct business code
- Invalid business entries may cause an audit to fail. If we keep logs in the SQL database, then we may lose them because the business part of the transaction failed
- We need to write our custom code and maintain the solution
- We need to modify each application
We have seen multiple mechanisms to implement auditing in the SQL database. Let’s now consider dimensions used to evaluate them.
Any solution is a cost. We need to understand it, maintain it, and even know that it’s there.
When it comes to configuring the solution, the logging approach is probably the simplest. We just enable it once for the whole database engine and it’s there. It is very likely to work well and not fail spuriously.
Similarly, temporal tables are easy to configure and we need to do it once per table. However, we may forget about that for new tables that we will create in the future, but this can be prevented with a static analysis.
Triggers and rules require more attention. We need to implement audit tables manually, maintain them with schema changes, and test the code for various scenarios. It’s hard to make them generic without losing the readability. Triggers and rules may also be hard to find for the developer. We may not even know they’re running behind the scenes which may lead to numerous issues.
Application code may be easier to implement, but it requires much better understanding of the moving pieces. However, this code may be easier to spot for the developer, and may be much more flexible in terms of what we do and how.
Another area is reasoning about the mechanism. If there are two triggers configured on the table - in which order will they be executed? If triggers call functions - are these functions going to be executed once for each row or once per transaction? What happens if the trigger fails? Answering these questions may be challenging, especially when we migrate our database from one engine to another.
When it comes to the performance, then triggers are generally considered slow. Especially when they’re executed for each row changed in the database. However, application code may be even slower if we execute new transactions for each row or send multiple queries over the wire.
Another performance aspect is cascading the solutions. If one trigger causes another trigger to fire, then the whole system may slow down significantly. Also, in case of long running triggers - are they going to be executed independently in each replica, or will they execute just once and the results will be sent over? Depending on our configuration, the performance may be severely affected.
Ease of use
The ultimate aspect we need to consider is whether the mechanism we use works well and provides the data we need. If we can’t “just browse” the audit logs, then it may be much less helpful when issues arise. Especially that we often use audit logs when we spot an error and want to track it down. These cases require us to work fast to minimize the impact, and so we want the audit logs to be available at hand.
The easiest way is to use temporal tables. If it is supported by our SQL engine and we don’t need to insert historical data, then temporal tables are probably the best way.
Otherwise, triggers are the standard way of implementing the audits. They have their drawbacks and issues, but they are supported directly by the database and will work in all of the scenarios. If we don’t want to use triggers for some reason, then query rewriting is an option as well.
If we can’t use triggers, then it’s probably best to implement the audit logs in the application code. This approach will take the most work and is error-prone, but also has some advantages in terms of migrating the solution to a different SQL engine.
Auditing is an important task for every database. We should think about it early in the development stage, and choose the solution that will work best for our needs. Temporal tables can be a low-hanging fruit that doesn’t take much time to implement and maintain. No matter which solution we go with, it’s crucial to test it well and make sure that it will provide all the data we need when the time comes.