Metis can analyze queries, suggest improvements, find performance issues, and automatically troubleshoot problems. We are going to use IMDb datasets. We are going to play with some queries, and see how Metis can identify performance characteristics.
You can find the datasets at IMDb datasets.
Information courtesy of
Used with permission.
Let’s start with preparing the database. We are going to use the PostgreSQL (Postgres) database.
Let’s start with downloading files:
Next, we need to decompress them:
Next, we need to replace \N characters with empty values in order to import them as NULLs:
Now, we need to create a database in our PostgreSQL instance. You can install the server locally or use the Docker image. Next, create the schema:
And now create tables. We are going to use TEXT columns to keep things simple:
Finally, we need to load the IMDb database:
The database is now ready and we can remove the input files:
Registering in Metis
You can now go to Metis and register. You are now ready to query the database. Start with a very simple query:
This should give the following output:
Now, we can get the execution plan for the query above. Let’s change the query to:
We should get something along the lines:
We can see how the SQL engine is going to execute the query. This is only the estimated plan, and the actual execution may differ. To get the actual plan, we can run this:
You can now copy the query and the execution plan to Metis Query Analyzer and get the following:
Let’s now see what environment I used for the experiments.
I used the following environment to run the tests below.
RDS instance db.m6g.large with PostgreSQL 13.7. It has 2 vCPUs and 8 GB of RAM.
EC2 instance t3.xlarge with AMI amzn2-ami-kernel-5.10-hvm-2.0.20221210.1-x86_64-gp2 with Amazon Linux 2 in version 5.10.157-139.675.amzn2.x86_64. It has 4 vCPUs and 16 GB of RAM memory.
Mac Mini with macOS 13 Ventura running on M1 with Darwin Kernel Version 22.4.0 It has 3.2GHz 8-Core M1 and 16GB of RAM.
Both EC2 and Mac used PostgreSQL 15.2 (Debian 15.2-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit running in a Docker container that you can get at public.ecr.aws/o2c0x5x8/metis-demo-db:latest.
For a given actor, find their latest movies
First query I used shows some latest movies a given actor has been in. This is the query:
And here is the output:
Let’s start with the estimated plan:
We can see that the SQL engine predicts using 3 tables, reading ~52M rows, and returning 10 of them. Let’s compare that with the actual:
We can see that the estimations were pretty accurate. The actual number of rows was closer to 56M, and the query returned 2 rows instead of 10.
However, based on the insights above, we can easily tell how to improve the performance. imdb.title_principals doesn’t have an index. If we go to the Query Tale tab, we can see that it scans the table:
How can we improve the query? Metis clearly indicates that a table is scanned and this decreases the performance! Let’s add the index:
The index is on the nconst column (which is used in the join operator), and it also includes the tconst column so that we can join it later on with another table. Now, the query finishes in less than a second:
The index clearly improved the performance.
For a given actor, find their ten most highly rated films
Let’s now find ten best movies a given actor has been in. The query is:
Here is the output:
This query runs for the following number of seconds:
Here is the estimated execution:
We can see a very similar story here. title_principals table doesn’t have an index, so the engine needs to scan it fully. Let’s compare with the actual execution:
We can see that the performance is low because of this missing index. Once we add it, we get the following performance:
The query executes in 1 millisecond now.
Find the ten top rated films with some number of votes
Let’s now find the best movies that received some number of votes. This is the query:
Let us now check the plan:
Once again, we lack an index. Metis clearly shows what index we should add:
This creates the index on the title_ratings table on the numvotes column. Let’s see if the actual execution agrees:
We can see that the number of rows read is actually higher than the estimate (1.31M vs 1.06M). The index would definitely help. Let’s add it and see the performance:
We can see that it greatly improved the runtime.
Given two people, list what movies they appeared in together
Let’s now write a query that finds movies with two specified actors. Query is:
Let’s now see the estimated execution plan:
Once again we see that the title_principals table is slowing the query down. Let’s compare that with the actual execution:
Estimations were pretty correct, and the query is slow. Metis shows that the table scan is the root cause of the low performance. Let’s add the index with:
Let’s see the performance now:
And we can see the performance is much better now.
List all of the cast and crew in a given movie
Let’s now find all the people involved in the movie. Here comes the query:
Here is the output:
And the timings:
Here are the estimates:
We can see three insights. One regarding number of rows (and missing index), one about the total cost of the plan, and one about the sorting of rows. Let’s compare that with the actuals:
We can see one more insight. That is regarding the I/O operations for the sorting part. When we go to the query tale, Metis shows that indeed scanning the name_basics table took a lot:
The optimizer decided to materialize the result of the scan because it couldn’t optimize the join with multiple filters.
Why do we need that join? This is because in directors and writers we store the value as CSV:
Not to mention, that this query extracted the title_basics table which we don’t need at all. How can we improve the query? We need to split the CSV into a regular column. First, let’s see how many identifiers we store at most:
We take the length of the writers column and subtract the length of the writers column with commas removed. The maximum value we get is 1390. This means that we have at most 1391 identifiers serialized in that field. For directors we get 490. Based on that we can safely assume that there are no more than 1500 identifiers. We can use that to split them and build a table with nconst identifiers:
We can now join this with name_basics table and get the final query:
Execution plan and analysis:
We can see that now the query uses indexes and is much faster. We could optimize it even further by calculating data that would let us avoid splitting by comma with each request.
Find the most prolific actor in a given period
Let’s now find the actor that did the most movies in a given period of time. Query:
Let’s see the estimated analysis:
Metis shows table scans, a lot. That’s because we filter based on the startyear column which isn’t indexed. Let’s fix that with the following indexes:
Let’s also notice that we don’t need to join name_basics first. We can start from the titles, and then get the details of the actor once we know who we’re looking for:
This gives the following analysis:
We can see how these changes improved the query significantly.
Find most prolific actors in a given genre
Let’s now find ten actors doing most movies in a given genre. The query goes like this:
Metis tells the following:
Metis immediately identifies a lack of indexes and operations that can be optimized. There are two things that we can do here. First, let’s add an index to find titles for a given genre faster:
Next, let’s rewrite the query to get the titles first and then join actors:
This gives the following analysis:
Specifically, we can see that we use indexes for each table and reach much fewer rows:
Finding most common coworkers
Let’s now find five people a given person worked the most with. We start with the following query:
Metis shows table scans and lack of indexes. See the numbers that the query reads millions of rows. Let’s build indexes for title_crew table:
Let’s also add indexes for title_principals:
This should give us a significant speedup:
We can see that we read thousands of rows now. That’s a great improvement.
We have seen a couple of good examples where Metis can clearly show database performance improvements. Thanks to that, we don’t go blind anymore. We can see our improvements, verify if indexes are beneficial, and see how the database performs. The crucial part is that we can finally get some clarity around the database internals. That’s the very first step towards building proper database guardrails.