Why choosing database is important
The database solution will affect nearly all parts of the application. Different libraries for data modeling will be used, data models may depend on the storage shape, input and output validation will be modified, even data presentation will be mostly dictated by the underlying schema.
To understand all the implications of choosing the database, let’s run through various areas of the application. We’ll list what should be considered, and how different requirements may affect our approach.
Ease of use
The very first aspect to consider is the ease of use of the database. SQL databases are generally easier to use for newcomers, because they are very well standardized, are typically covered in university courses, and are common in the industry, so probably every developer worked with them at some point during their career. There are tons of helpful materials on the internet, books explaining implementation details, conference talks, blogs, or online courses.
Tooling is also important. How easy it is to connect to the database from the application written in the language of our choice, how hard it is to add observability, or how long it takes to install and configure the database. We need to verify if installation can be automated, if it can be deployed from the CLI tools, or whether it can be set-up with Infrastructure as Code (IaC).
Another aspect is how hard it is to model the domain in a given database and in our programming language. There is a known case of Object-Oriented Impedance Mismatch where not all SQL concepts can be represented in a typical Object-Oriented language and vice-versa. If we go with functional programming, then the issue may be technically different due to data immutability. We need to think about how often our schema changes, how structured the data is, whether normalization is going to cause any issues on our side.
Finally, a variety of tools should be considered. Is there only one UI application capable of connecting to the database and presenting its internals, or maybe there are plenty of solutions available off-the-shelf. Building our own tooling will take time and may not be cost effective, especially during the prototyping phase.
Yet another aspect is scalability. There are two dimensions to consider.
First is the fact that our database will grow over time, especially after many years. It’s not uncommon to have tens or hundreds of tables and gigabytes or terabytes of data after our application turns mature. Will the database work well with millions of entities stored inside? Will it be able to load more, or will the performance degrade badly? Even if the data is not used but just staying at ease, it may consume precious resources and slow the system down. It is especially hard to predict how our business will evolve over the years, so generally we should aim to use the database that doesn’t impose any limitations. Obviously, every system has its limits, and there may be a natural upper bound for any data we store (for instance, we probably won’t have more than 9 billion living customers registered in 2023), but there should be no inherent architectural component preventing the scaling.
Another aspect is the performance in a given timeframe. We should think about peaks during important business times - think Christmas, Halloween, weekends, Monday mornings and so on. Our application will need to handle much bigger traffic, many more transactions per second, more processes running concurrently and in parallel. Will our database scale up with our business getting bigger and bigger? How many trade-offs will we need to make to keep the performance from degradation?
Scalability depends heavily on the type of the application. We’ll have different requirements for OLTP applications that handle many short-living transactions, and different requirements for ETL or reporting solutions that query tons of data, but can wait for the result for minutes or even hours. Still, they need to continue to operate despite increased storage size.
Obviously, we want to have the fastest database possible. However, it’s never a free lunch. We need to pay for the performance, either with cash, or with changes to our applications. We should plan in advance and not get surprised by a significant increase in the traffic or data load.
Another area is how changes are propagated. The fact that our application “committed” the transaction and believes that it’s completed doesn’t mean that this is the reality. Especially when we scale the database globally.
First thing to evaluate here is what is the consistency guarantee provided by the database. Things like Read Your Writes or which Consistency Models are supported in general. This may change once we cluster our database, or configure partitioning, sharding, always-on replicas, and other scaling solutions.
Consistency will be affected by our application as well. We may control the Transaction Isolation Level inside our application, and this will affect the visibility and consistency of the data. It may not be a problem to read data that is a couple of seconds old for some cases, but this depends on the business we’re operating. For some use-cases this is fine, for some others it may not be acceptable.
Also, pay attention to the defaults of the libraries we use. They will configure the isolation level, session properties, and other parameters that will affect the consistency.
Not to mention, that some consistency levels may not be achievable ever. That’s not because the software is not mature enough yet, but that’s because it’s mathematically impossible. This is just how the world goes.
Next area to check if the atomicity and transaction handling in the database. Can we modify multiple records at once? Can we modify only a subpart of the record or do we need to modify everything? Does it scale well with the increased data size?
This is especially important in cases when there is some complex business logic involved. For instance, we may need to implement Sagas and Compensating transactions to keep our data consistent because we can’t run a distributed transaction. Another issue may be the performance - Two-phase commit protocol performs really badly when run around the planet.
Generally, we assume that our changes are “atomic” in a sense that we can apply all of them at once or none of them will start propagating. This may not be technically possible. We then need to reevaluate our business needs, and consider if we can handle these changes on a business level. For instance, e-shop may decide to accept some loss because of invalid transactions posted to not lose the customers’ trust.
Our applications must read the data. Reading queries mostly come down to scanning data sources, joining them together if needed, and filtering the results based on some criteria. The criteria we choose are the important aspect - we want our queries to finish as fast as possible, and some filtering can be easier said than done. Two most important aspects are sargable expressions and filtering as many entities as possible.
Sargable expressions are expressions that the database can optimize. For instance, if we look for an entity with a specific property P equals V, and the entities are sorted, then finding the row is a matter of doing some binary search. This generally scales well and is fast. Hence, the engine can optimize a given expression and run it much faster.
However, if we look for an entity with a property P that matches some function f(P) result may be impossible to optimize, and we’ll need to execute the function for every single entity.
Let’s take an example here. Imagine that we look for entities for which the P = creation date is V = January 1st, 2023. This is easy if all the entities are stored in the order of creation. We can simply do the binary search and find all the entities matching the criteria. However, if we look for entities for which the function f(P) = year of P may be impossible to optimize because the engine may not realize that it can still use the order of rows to improve the search.
The second case is filtering as many entities as possible. It’s basically a matter of putting our filtering criteria in an order that would remove rows faster. Let’s say that we look for all people born in 1990 whose first name is John. Should we first filter using the date of birth or the name? What do we think: there are more Johns in our dataset or more people born in 1990? The order may matter a lot.
Things get more complex when we join multiple data sources together. If we can guarantee that the data before joining is in order, then we can join sources much faster. The database engine can use binary search or merge join algorithms to improve the performance.
How can we improve the performance? By adding indexes. However, indexes are not free. While they improve the search performance, they decrease the performance of insertion/deletion/modification because the order of the entities must be maintained. This means that we want to add as many indexes as needed but not more.
How do we know which indexes to add? This is where our search patterns come into play. Depending on what our application does, we may decide to add indexes or drop them. We may consider various filtering strategies. We may finally join data sources in a different way. Some databases are heavily optimized for a specific type of filtering, like full text search, or filtering based on coordinates. If our access patterns focus mostly on these specific use cases, then it may be a good idea to choose a database that supports them much better.
Interoperation with other applications
We may have multiple applications running against the same database. These applications may use different technologies, programming languages, drivers, may be built with different paradigms, or may send different queries depending on their characteristics (think OLTP vs ETL). Should we have one data source for all of them? Should we have different databases? How can we migrate data between these different databases? Are we going to run into challenges with ETL?
Using one database for all the use cases may not be the best approach. However, sharing data between data storages may be hard to maintain in the long run. We should always think about possible issues with lack of single source of truth, or inconsistency because of long synchronization. Not to mention, that different tech stacks may cause issues with different ways of handling international characters, numbers with high precision, or dates.
Another aspect of interoperability is the schema migrations. Which application should be in charge of the schema? Who should decide when the data source is modified? How does it happen, and where is the primary source of the schema? These things need to be carefully considered to not run into issues that some applications break because they expect the old schema to still be in place.
Our solution must survive years. We shouldn’t put an expiration date only a couple months in advance, we need to plan ahead for years and decades. Things will change in that time: new technologies will emerge, some companies will go out of business, trends will change, and new things will become fashionable.
We need to consider how this may affect our decision. First, some databases presenting a “new approach” may fall out of scope. Some new solutions may not become mature enough to be used in long-running projects. Using something “old and boring” may be much more reliable in the long run.
Another aspect is support from the community and companies. We may not get paid support in a decade from now, we may lose all the specialists from the market, and there may be nobody in StackOverflow that would know how to answer our questions While we can grow our own specialists over the year, they may leave the company and take their precious tribal knowledge with them Using something popular and well-supported in the industry will make finding experts far easier.
Finally, another case to consider is the hype, popularity, and fashion. Solutions that are well-known will just get much better support. This amounts to more books, blog posts, documentation, materials, workshops, training, courses (paid or free of charge), companies willing to share their expertise and provide support. Niche solutions will have much lower visibility, and will get much less attention in the industry.
Hosting externally or on-premise
We need to host the solution somewhere. We can either do it ourselves (on-premise), or use some external data-center taking care of that (cloud, dedicated solution provided by the owner, etc.).
Going on-premise has many benefits. We control the connectivity and the storage, so we can comply with GDPR and other privacy acts much easier. We know where our data is, we can just physically go there and get the access. We can configure VPNs of our choice, or control the access with solutions aligned with our security requirements. On the other hand, we need to maintain the infrastructure manually. We need to install updates, fix bugs, backup things, make sure it’s protected and secured.
Going with a cloud or dedicated solution is much easier to maintain. Things just get proper support by the hosting provider. We don’t need to think about the infrastructure, updates, backups, power outages, and other usual aspects. If something fails, we typically have some SLA with the provider, so we can get insurance in case of the system downtime. However, we may not be able to control the VPN solution, or where and how our data is stored. We may not be able to encrypt it in the way that would guarantee that nobody has access to it. We may need to implement our own client-side-encryption, or limit the way we store the data in order to not leak it outside of the geographical region due to government regulations.
Price and licensing
No matter what database we choose, we need to pay for running the solution. This includes:
- License for the servers/clusters/applications,
- Ongoing costs of hosting/cloud/infrastructure,
- Salary for the crew maintaining the solution on our end,
- Overall cost of learning, supporting, and maintaining the pieces talking to the database.
Some of these aspects may be hard to evaluate, but let’s focus on the ones that we can easily check before starting the project.
The database may be free of charge (like open-source solutions), paid based on the subscription/licensing, or even available only within some infrastructure (like AWS DynamoDB). We need to evaluate the potential cost of using the database in the following ways:
- Local development environment for each developer,
- Database per feature-branch to run tests,
- Load tests,
- Team’s development environment,
- Q&A, staging, pre-production environments,
- Production environment.
For each of these cases we need to consider the load, the traffic, and all the costs of booting up the solution. Due to licensing, we may not have the feature parity between environments (i.e., we use free edition locally, but enterprise edition for the production environment). We also need to consider the licenses for the hardware, as some databases require subscription for mainframes or cores, etc.
Next, we need to evaluate the ongoing costs of the infrastructure. How much are we going to pay on a daily basis for keeping the environment up and running? What if we need to increase the load, run load tests, performance tests, etc.?
Finally, we need to find the right specialists knowing how to use and operate the database. Obviously, they will need to be paid for doing their job, and the salary will depend on the technology we have in mind.
We covered some aspects which we need to consider when choosing the database. Let’s now see some actual solutions that we can evaluate to make our final decision.
SQL (Relational) databases
SQL databases provide a variety of features that have been developed and improved for many decades now. Due to the history, scaling features were added much later with the increasing demand for the scalability, and so they evolved in multiple directions.
Examples of relational databases:
- MySQL / MariaDB
- Microsoft SQL Server
- Distributed SQL: YugaByte, CockroachDB
- SQL Databases support ACID, an acronym that stands for Atomicity, Consistency, Isolation, and Durability. That means an SQL database has a set of properties that ensure reliable processing of database transactions in SQL.
- Atomicity: This property ensures that a transaction is treated as a single, indivisible unit of work. If any part of a transaction fails, then the entire transaction is rolled back, leaving the database in the same state it was in before the transaction began.
- Consistency: This property ensures that the database remains in a consistent state before and after each transaction. This means that any changes made by a transaction must meet certain constraints defined by the database schema.
- Isolation: This property ensures that each transaction is isolated from other transactions, so that the operations performed by one transaction are not visible to other transactions until the first transaction has completed.
- Durability: This property ensures that once a transaction has been committed, it is permanently recorded in the database and will survive any subsequent system failures.
- Structured data: SQL is designed for structured data, which is data that is organized in a specific way, using a fixed schema, and can be easily searched and analyzed.
- Maturity: SQL has been in use for over four decades, and over that time it has been refined and improved to meet the needs of a wide range of applications. As a result, SQL has a well-established set of standards and best practices, and there is a large and active community of developers and users who can provide support and guidance. Additionally, many third-party tools and libraries have been developed to work with SQL databases, making it easier to integrate SQL into a wide range of applications.
- Scalability - SQL databases can be scaled out using a shared-nothing architecture, which involves distributing the data and processing across multiple nodes, each with its own memory, storage, and processing power. This approach enables horizontal scalability, as nodes can be added or removed as needed to accommodate changes in demand.
- Schema Migration: Schema migration in SQL refers to the process of making changes to the structure of a database, such as adding or modifying tables, columns, or indexes. While schema migration can be necessary to support changes in an application, it can also be problematic, especially in large and complex databases. One of the main issues with schema migration is that it can be difficult to ensure that the changes are applied correctly and consistently across all instances of the database. This can lead to errors, data inconsistencies, and even downtime if the migration fails.
- Analytics: While SQL databases are well-suited for transactional processing and data management, they are not always the best choice for analytics and OLAP (Online Analytical Processing) solutions and may struggle with complex analytical queries that require aggregations, filtering, and sorting of large amounts of data. For analytics consider using OLAP solutions, designed specifically for these types of queries and provide features such as multidimensional data modeling, advanced analytics, and fast query processing.
- Inflexibility: Entities must conform to the schema. While SQL supports non-structured columns (like JSONB), it doesn’t handle a variety of entities well. This is especially important for storing data that we don’t fully control, or the data that can contain dynamic properties that can be added and removed on the fly.
- Scalability: SQL databases want to ensure ACID properties as much as possible. This leads to the usage of 2 phase commit protocol or other synchronization approaches. While they provide great guarantees regarding the data quality, they typically make transactions run and scale slower.
NoSQL databases try to address the inefficiencies of SQL databases in terms of scalability and flexibility in the schema control. They are typically built on a case-by-case basis, and provide a very specific set of requirements and features.
- Scalability: NoSQL databases are designed to scale horizontally across many servers, which can make them easier to scale as the amount of data grows.
- Flexibility: NoSQL databases can be more flexible than relational databases when it comes to handling unstructured or semi-structured data, because they don't rely on a fixed schema.
- Performance: NoSQL databases can be faster than relational databases for certain types of queries, especially if the data is distributed across many servers.
- Consistency and durability: NoSQL databases are often designed to prioritize scalability and performance over consistency and durability. This means that in some cases, updates to the database may not be immediately consistent across all nodes, and data loss may be more likely in the event of a hardware failure.
- Learning curve: because NoSQL databases are often designed to handle data in a different way than relational databases, there can be a learning curve involved in using them.
- Limited querying and reporting capabilities: NoSQL databases are designed to handle large volumes of unstructured or semi-structured data, but they are generally not as well-suited for complex querying and reporting. This can make it more difficult to analyze or extract insights from the data.
- Less mature technology: While NoSQL databases have been around for some time, they are still a relatively new technology compared to relational databases. As a result, there may be fewer tools and technologies available for managing and troubleshooting NoSQL databases, and fewer experienced developers available to work with them.
- Data modeling complexity: Because NoSQL databases are designed to be schema-less or schema-light, it can be more difficult to model data in a way that ensures data consistency and integrity. This can require more careful planning and design upfront, and may require more ongoing maintenance over time.
- Limited community support: While there is a growing community of developers and users working with NoSQL databases, the community is still smaller than the community of developers and users working with relational databases. This can make it more difficult to find help and resources when encountering problems or challenges.
Graph databases store data as nodes and edges between them. They want to represent the notion of connections between entities. They primarily focus on queries that explore entities connected to each other directly or indirectly.
They are very flexible in terms of schema, as each attribute can be represented as an edge to an entity of a different type. Each node may have a different set of connections, so you can represent any schema this way.
Interestingly enough, graph databases were one of the first databases used in computer science. They were known as the Network model in the 1970s.
- They can represent any relationships and typically provide fast join operations
- They can represent any attributes
- They scale well with the number of queries
- They support wide range of algorithms for problems like path finding, similarity, classification, distance calculations
- They may be much easier to query for social analysis, as our typical understanding of the relationship transfers well to the graphs
- It may be hard to represent transactions, as it’s challenging to modify multiple entities at once. This can be modified by introducing additional level of indirection and grouping entities into change sets, but that may be inflexible
- There is no standardized query language for graph databases
- Schema flexibility may be a challenge due to lack of understanding what the typical attributes are
- They may be too slow for queries spanning the entire database
- They may be too slow to process a high volume of transactions
- Some graph databases don’t scale between single node
- They may not be optimized for warehousing queries
- Querying may not scale with the amount of data we extract, but it may depend way more on the size of the graph we traverse
Object-oriented databases hold objects from the object-oriented paradigm. They aim to ease the impedance mismatch between SQL (relational) approach and OOP world.
They may work well if we deal with a complex domain model (think Domain Driven Design) that needs to be closely resembled in our application. However, it may be much harder to use if we don’t follow the object-oriented paradigm, so sharing the database between programming languages may be significantly harder.
- Impedance mismatch is heavily minimized
- There are no joins between relations, as objects are stored “natively”
- Schema is flexible and closely resembles the “true” domain model
- Querying is typically slower than in SQL
- There is no one standardized querying language
- Transaction support may be limited due to the way how objects are stored
- Using the database from a non-oop language may be much harder
- Warehousing queries may be much slower
Cookbook for choosing a database
Let’s now wrap up with a checklist of how to make our final decision.
- Are there any government regulations, privacy requirements, or internal guidance that we need to strictly follow?
- What hosting solutions can we consider, and which solutions are definitely out of scope?
- How much can we pay for the database in the next week? Month? Year? Decade?
- What applications will need to talk to the database?
- What scale do we need to support? How many transactions per second do we need?
- How do we operate our business in face of inconsistencies or delayed propagation?
- What is our primary access pattern when operating with the database? How do we deal with the data?
- Do we have specialists at hand for the given technology? Do we think it’s going to be a problem in the next decade to find people with a working knowledge of the database?
- How are we going to host the database?
- How easy is it to use and operate the database?
- Finally, how much do we like the solution?
Once we answer all the questions based on the remarks provided earlier, we can make our choice consciously and explicitly. This gives us pretty good odds of succeeding later on.
Choosing a database for the next project is not an easy task. However, by keeping the checklist in mind, evaluating aspects mentioned in this post, and understanding the environment, we can make our choice confidently. This will give us pretty high chances of succeeding in the future.