How Metis Optimized Queries Executed by Sequelize - Part 2

Sequelize is an ORM for Node.js and TypeScript. It can connect to PostgreSQL and other SQL engines. It supports transactions, relations, eager loading, lazy loading, functions, and other common operations. Previous part covered some of the queries from Part 1 of this series. This time we are going to implement more sophisticated scenarios to see how Sequelize can deal with them. Let’s go.

Given two people, list what movies they appeared in together

Let’s find common movies for two people. That’s the application code doing that:

return titleBasic
.findAll({
	include: [
	{
		model: titlePrincipal,
		required: true,
		duplicating: false,
		as: 'titleBasicTitlePrincipal',
		where: {
			'nconst': actor1
		}
	},
	{
		model: titlePrincipal,
		required: true,
		duplicating: false,
		as: 'titleBasicTitlePrincipal2',
		where: {
			'nconst': actor2
		}
	},
]});

The code generates the following query:

SELECT
	TitleBasic.tconst,
	TitleBasic.titletype,
	TitleBasic.primarytitle,
	TitleBasic.originaltitle,
	TitleBasic.isadult,
	TitleBasic.startyear,
	TitleBasic.endyear,
	TitleBasic.runtimeminutes,
	TitleBasic.genres,
	titleBasicTitlePrincipal.tconst AS titleBasicTitlePrincipal.tconst,
	titleBasicTitlePrincipal.ordering AS titleBasicTitlePrincipal.ordering,
	titleBasicTitlePrincipal.nconst AS titleBasicTitlePrincipal.nconst,
	titleBasicTitlePrincipal.category AS titleBasicTitlePrincipal.category,
	titleBasicTitlePrincipal.job AS titleBasicTitlePrincipal.job,
	titleBasicTitlePrincipal.characters AS titleBasicTitlePrincipal.characters,
	titleBasicTitlePrincipal2.tconst AS titleBasicTitlePrincipal2.tconst,
	titleBasicTitlePrincipal2.ordering AS titleBasicTitlePrincipal2.ordering,
	titleBasicTitlePrincipal2.nconst AS titleBasicTitlePrincipal2.nconst,
	titleBasicTitlePrincipal2.category AS titleBasicTitlePrincipal2.category,
	titleBasicTitlePrincipal2.job AS titleBasicTitlePrincipal2.job,
	titleBasicTitlePrincipal2.characters AS titleBasicTitlePrincipal2.characters
FROM imdb.title_basics AS TitleBasic
INNER JOIN imdb.title_principals AS titleBasicTitlePrincipal ON TitleBasic.tconst = titleBasicTitlePrincipal.tconst AND titleBasicTitlePrincipal.nconst = 'nm0302368'
INNER JOIN imdb.title_principals AS titleBasicTitlePrincipal2 ON TitleBasic.tconst = titleBasicTitlePrincipal2.tconst AND titleBasicTitlePrincipal2.nconst = 'nm0001908';

Metis indicates a missing index:

‍

Database performance Insights on Metis - indicating a missing index

‍

The query takes 5 seconds to finish. We can now improve the query performance by adding an index for title_principals. This is the code to do that:

CREATE INDEX IF NOT EXISTS title_principals_nconst_idx ON imdb.title_principals(nconst) INCLUDE (tconst)

Notice that we had to configure two associations in the model to make this work. This is very misleading and should be avoided if possible. What if we don’t like the second association? We can implement joins in the application code, like this:

const first = titlePrincipal
.findAll({
	attributes: ['tconst'],
	where: {
		'nconst': actor1
	}
}).then(titles => titles.map(t => t.tconst));


const second = titlePrincipal
.findAll({
	attributes: ['tconst'],
	where: {
		'nconst': actor2
	}
}).then(titles => titles.map(t => t.tconst));


return first.then(firstTitles => second.then(secondTitles => {
	return titleBasic
	.findAll({
		where: {
			tconst: {
				[Op.and]: [
				{
					[Op.in]: [...firstTitles]
				},
				{
					[Op.in]: [...secondTitles]
				}
				]
			}
		}
	});
}));

We first load titles for the first person, then load titles for the second person. Finally, we process the lists together, and we send another query that would use both of them. This is the interaction with the database:

First query:

SELECT tconst
FROM imdb.title_principals AS TitlePrincipal
WHERE TitlePrincipal.nconst = 'nm0302368';

Second query:

SELECT tconst
FROM imdb.title_principals AS TitlePrincipal
WHERE TitlePrincipal.nconst = 'nm0001908';

These two queries are rather straightforward. We could then intersect the datasets in the application code, or we can ask the database to do so. No matter what we do, we end up with a query similar to this one:

SELECT
	tconst,
	titletype,
	primarytitle,
	originaltitle,
	isadult,
	startyear,
	endyear,
	runtimeminutes,
	genres
FROM
	imdb.title_basics AS TitleBasic
WHERE
(
	TitleBasic.tconst IN (
		'tt0000439','tt0014222','tt0021457','tt0029541','tt0447157','tt0020221','tt0020275','tt0460244','tt0829197','tt0008572','tt0009046','tt0040991','tt0155701'
	)
	AND TitleBasic.tconst IN (
		'tt0000439','tt0000452','tt0000501','tt0000528','tt0000578',...
	)
);

This scenario is now slower. Processing results in application takes 9 seconds to complete, so it’s slower than plain SQL. We can’t tell that it’s always the case, as sometimes it’s faster to process data in the application, especially when we don’t extract many records from the SQL. Your mileage may vary, so always measure your solution.

List all of the cast and crew in a given movie

Let’s now find all the people involved in the movie. Here is the first solution. We can just send the raw query:

return sequelize.query(`
	SELECT DISTINCT NB.*
	FROM imdb.title_basics AS TB
	LEFT JOIN imdb.title_principals AS TP ON TP.tconst = TB.tconst
	LEFT JOIN imdb.title_crew AS TC ON TC.tconst = TB.tconst
	LEFT JOIN imdb.name_basics AS NB ON
		NB.nconst = TP.nconst
		OR TC.directors = NB.nconst
		OR TC.directors LIKE NB.nconst || ',%'::text
		OR TC.directors LIKE '%,'::text || NB.nconst || ',%'::text
		OR TC.directors LIKE '%,'::text || NB.nconst
		OR TC.writers = NB.nconst
		OR TC.writers LIKE NB.nconst || ',%'::text
		OR TC.writers LIKE '%,'::text || NB.nconst || ',%'::text
		OR TC.writers LIKE '%,'::text || NB.nconst
	WHERE TB.tconst = :tconst
`, {
	model: nameBasic,
	mapToModel: true,
	replacements: {
	tconst: tconst
	}
});

This works, but is inefficient as we identified in the previous part. We can do the query with union to speed things up:

return sequelize.query(`
		SELECT DISTINCT NB.*
		FROM imdb.title_principals AS TP
		JOIN (
			SELECT tconst, directors, writers
			FROM imdb.title_crew
			WHERE tconst = :tconst
		) AS TC ON TC.tconst = TP.tconst
		LEFT JOIN imdb.name_basics AS NB ON NB.nconst = TP.nconst
	UNION
		SELECT DISTINCT NB.*
		FROM imdb.title_principals AS TP
		JOIN (
			SELECT tconst, directors, writers
			FROM imdb.title_crew
			WHERE tconst = :tconst
		) AS TC ON TC.tconst = TP.tconst
		LEFT JOIN imdb.name_basics AS NB ON TC.directors LIKE NB.nconst || ',%'::text
	UNION
		SELECT DISTINCT NB.*
		FROM imdb.title_principals AS TP
		JOIN (
			SELECT tconst, directors, writers
			FROM imdb.title_crew
			WHERE tconst = :tconst
		) AS TC ON TC.tconst = TP.tconst
		LEFT JOIN imdb.name_basics AS NB ON TC.directors LIKE '%,'::text || NB.nconst || ',%'::text
	UNION
		SELECT DISTINCT NB.*
		FROM imdb.title_principals AS TP
		JOIN (
			SELECT tconst, directors, writers
			FROM imdb.title_crew
			WHERE tconst = :tconst
		) AS TC ON TC.tconst = TP.tconst
		LEFT JOIN imdb.name_basics AS NB ON TC.directors LIKE '%,'::text || NB.nconst
	UNION
		SELECT DISTINCT NB.*
		FROM imdb.title_principals AS TP
		JOIN (
			SELECT tconst, directors, writers
			FROM imdb.title_crew
			WHERE tconst = :tconst
		) AS TC ON TC.tconst = TP.tconst
		LEFT JOIN imdb.name_basics AS NB ON TC.writers = NB.nconst
	UNION
		SELECT DISTINCT NB.*
		FROM imdb.title_principals AS TP
			JOIN (
			SELECT tconst, directors, writers
			FROM imdb.title_crew
			WHERE tconst = :tconst
		) AS TC ON TC.tconst = TP.tconst
		LEFT JOIN imdb.name_basics AS NB ON TC.writers LIKE NB.nconst || ',%'::text
	UNION
		SELECT DISTINCT NB.*
		FROM imdb.title_principals AS TP
		JOIN (
			SELECT tconst, directors, writers
			FROM imdb.title_crew
			WHERE tconst = :tconst
		) AS TC ON TC.tconst = TP.tconst
		LEFT JOIN imdb.name_basics AS NB ON TC.writers LIKE '%,'::text || NB.nconst || ',%'::text
	UNION
		SELECT DISTINCT NB.*
		FROM imdb.title_principals AS TP
		JOIN (
			SELECT tconst, directors, writers
			FROM imdb.title_crew
			WHERE tconst = :tconst
		) AS TC ON TC.tconst = TP.tconst
		LEFT JOIN imdb.name_basics AS NB ON TC.writers LIKE '%,'::text || NB.nconst
`, {
	model: nameBasic,
	mapToModel: true,
	replacements: {
	tconst: tconst
}});

They take 100 seconds and 25 seconds respectively. That’s not very fast. Let’s try implementing the same in the application code:

const crewViaTitlePrincipals = titlePrincipal
.findAll({
	attributes: ['nconst'],
	where: {
		'tconst': tconst
	}
}).then(crew => crew.map(c => c.nconst));

const crewViaTitleCrew = titleCrew
.findAll({
	where: {
		'tconst': tconst
	}
});

const crewMatchingNames = crewViaTitleCrew.then(crew => crew.flatMap(c => [
	c.directors.split(','),
	c.writers.split(',')
].flat()));

const allMatchingNames = crewViaTitlePrincipals.then(crew1 => crewMatchingNames.then(crew2 => new Set([crew1, crew2].flat())));

return allMatchingNames.then(names => nameBasic
	.findAll({
		where: {
			'nconst': { [Op.in]: [...names] }
		}
	}));

We first take the crew from the title_principals table. Next, we extract the crew from the title_crew. Finally, we split directors and writers, and we calculate the intersection of both datasets to get the names.

This generates the following trace in Metis:

‍

Database performance Insights on Metis - wrong use of SELECT command

‍

It’s all green! We can see that we have three queries sent, but they are blazingly fast because of indexes. This time it was much better to do processing in the application instead on the SQL level.

As mentioned in the previous section, there is no silver bullet. You need to measure your code and see where the performance benefits are.

Find the most prolific actor in a given period

Let’s now find the person that starred in the most movies in a given period. Let’s do that using application code:

const titlesMatchingPeriod = titleBasic
.findAll({
	attributes: ['tconst'],
	where: {
		startyear: {
			[Op.and]: [
				{ [Op.gte]: startYear },
				{ [Op.lte]: endYear }
			]
		}
	}
}).then(titles => titles.map(t => t.tconst));

const principals = titlesMatchingPeriod.then(titles => titlePrincipal
.findAll({
	attributes: ['nconst'],
	where: {
		tconst: { [Op.in]: [...new Set(titles)] }
	}
}).then(principals => {
	const counts = principals
	.reduce(
		(entryMap, e) => {
			entryMap[e.nconst] = (entryMap[e.nconst] || 0) + 1;
			return entryMap;
		},
		{}
	);
	const keys = Object.keys(counts);
	const countsWithKeys = keys.map(k => [counts[k], k]);
	countsWithKeys.sort((pair1, pair2) => pair2[0] - pair1[0]);
	const topResults = countsWithKeys.splice(0,1);
	return topResults;
}));

return principals.then(countsWithKeys => nameBasic
.findAll({
	where: {
		nconst: { [Op.in]: countsWithKeys.map(c => "" + c[1]) }
	}
}).then(actors => actors.map(a => {
	a.movies_count = countsWithKeys.filter(c => c[1] == a.nconst)[0][0];
	return a;
})));

We first find titles in a given period. Next, we take the crew for each of them. Finally, we need to group the identifiers and get the top one, and get the details. This is a bit lengthy and is rather slow because we get all the movies in a given period. That’s a lot of data:

Database performance Insights on Metis for list of actors

‍

Metis indicates nearly 70 000 rows. That’s a lot to process. What’s more, we then take all those identifiers and send in another query which is nearly 100 000 characters long.

Can we do better? Yes, we can send a raw query:

return sequelize.query(`
	WITH best_actor AS (
		SELECT TP.nconst, COUNT(*) AS number_of_titles
		FROM imdb.title_basics AS TB
		LEFT JOIN imdb.title_principals AS TP ON TP.tconst = TB.tconst
		WHERE TB.startyear >= :startyear AND TB.startyear <= :endyear AND TP.nconst IS NOT NULL
		GROUP BY TP.nconst
		ORDER BY number_of_titles DESC
		LIMIT 1
	)
	SELECT BA.nconst, BA.number_of_titles, NB.primaryname, nb.birthyear, NB.deathyear, nb.primaryprofession
	FROM best_actor AS BA
	JOIN imdb.name_basics AS NB ON NB.nconst = BA.nconst
`, {
	model: nameBasic,
	mapToModel: true,
	replacements: {
		startyear: startYear,
		endyear: endYear
	}
});

Metis shows the following:

‍

Database performance Insights on Metis using a raw query

There are two tables that we can configure indexes on: title_principals and title_basics. Once we do that, we get the best performance.

Find most prolific actors in a given genre

Let’s now find actors that did the most movies in a given genre. This is a very similar scenario to the previous one. Let’s start with the application code again:

const titlesMatchingGenre = titleBasic
.findAll({
	attributes: ['tconst', 'genres'],
	where: {
		genres: { [Op.like]: '%' + genre + '%' }
	}
}).then(titles => titles
	.filter(t => t.genres.split(',').indexOf(genre) >= 0)
	.map(t => t.tconst)
);

const principals = titlesMatchingGenre.then(titles => titlePrincipal
.findAll({
	attributes: ['nconst'],
	where: {
		tconst: { [Op.in]: [...new Set(titles)] }
	}
}).then(principals => {
	const counts = principals
	.reduce(
		(entryMap, e) => {
			entryMap[e.nconst] = (entryMap[e.nconst] || 0) + 1;
			return entryMap;
		},
		{}
	);
	const keys = Object.keys(counts);
	const countsWithKeys = keys.map(k => [counts[k], k]);
	countsWithKeys.sort((pair1, pair2) => pair2[0] - pair1[0]);
	const topResults = countsWithKeys.splice(0,10);
	return topResults;
}));

return principals.then(countsWithKeys => nameBasic
.findAll({
	where: {
		nconst: { [Op.in]: countsWithKeys.map(c => "" + c[1]) }
	}
}).then(actors => actors.map(a => {
	a.movies_count = countsWithKeys.filter(c => c[1] == a.nconst)[0][0];
	return a;
})));

This is nearly the same as before, only this time we query for genre. Metis shows the following:

‍

Database performance Insights on Metis querying for movie genre

This is even worse than the previous scenario. We now extract around 400 000 rows. While it takes 2.5 seconds to execute the query, sending the data over the wire takes much longer.

To fix that, we can use the raw query:

return sequelize.query(`
	WITH best_actors AS (
		SELECT TP.nconst, COUNT(*) AS movies_count
		FROM imdb.title_basics AS TB
		LEFT JOIN imdb.title_principals AS TP ON TP.tconst = TB.tconst
		WHERE TB.genres = :genre OR TB.genres LIKE (:genre || ',%') OR TB.genres LIKE ('%,' || :genre || ',%') OR TB.genres LIKE ('%,' || :genre)
		GROUP BY TP.nconst
		ORDER BY movies_count DESC
		LIMIT 10
	)
	SELECT BA.nconst, NB.primaryname, NB.birthyear, BA.movies_count
	FROM best_actors AS BA
	JOIN imdb.name_basics AS NB ON NB.nconst = BA.nconst
	ORDER BY movies_count DESC
`, {
	model: nameBasic,
	mapToModel: true,
	replacements: {
		genre: genre
	}
});

This gives the following:

‍

Finally, we can configure indexes and make this query much faster.

Finding most common coworkers

Let’s now try to find people that work together the most. We can do that with a raw query:

return sequelize.query(`
	WITH RECURSIVE numbers AS (
		SELECT 1 AS number
		UNION ALL
		SELECT number + 1 AS number FROM numbers WHERE number < 1500
	),
	titles_for_person AS (
			SELECT TC.tconst
			FROM imdb.title_crew AS TC
			WHERE directors = :nconst OR directors LIKE :nconst || ',%' OR directors LIKE '%,' || :nconst || ',%' OR directors LIKE '%,' || :nconst
		UNION
			SELECT TC.tconst
			FROM imdb.title_crew AS TC
			WHERE writers = :nconst OR writers LIKE :nconst || ',%' OR writers LIKE '%,' || :nconst || ',%' OR writers LIKE '%,' || :nconst
		UNION
			SELECT tconst
			FROM imdb.title_principals
			WHERE nconst = :nconst
	),
	titles_corresponding AS (
		SELECT TC.tconst, TC.directors, TC.writers
		FROM imdb.title_crew AS TC
		JOIN titles_for_person AS TFP ON TFP.tconst = TC.tconst
	),
	split_associations AS (
			SELECT TC.tconst, SPLIT_PART(TC.directors, ',', N.number) AS nconst
			FROM titles_corresponding AS TC
			CROSS JOIN numbers AS N
			WHERE directors IS NOT NULL AND CHAR_LENGTH(directors) - CHAR_LENGTH(REPLACE(directors, ',', '')) + 1 >= N.number
		UNION
			SELECT TC.tconst, SPLIT_PART(TC.writers, ',', N.number) AS nconst
			FROM titles_corresponding AS TC
			CROSS JOIN numbers AS N
			WHERE writers IS NOT NULL AND CHAR_LENGTH(writers) - CHAR_LENGTH(REPLACE(writers, ',', '')) + 1 >= N.number
	),
	all_associations AS (
			SELECT SA.tconst, SA.nconst
			FROM split_associations AS SA
		UNION
			SELECT TP.tconst, TP.nconst
			FROM imdb.title_principals AS TP
			JOIN titles_for_person AS TFP ON TFP.tconst = TP.tconst
	),
	other_people AS (
		SELECT nconst
		FROM all_associations
		WHERE nconst != :nconst
	),
	top_peers AS (
		SELECT OP.nconst, COUNT(*) as common_titles
		FROM other_people AS OP
		GROUP BY nconst
		ORDER BY common_titles DESC
		LIMIT 5
	)
	SELECT TP.nconst, TP.common_titles, NB.*
	FROM top_peers AS TP
	JOIN imdb.name_basics AS NB ON NB.nconst = TP.nconst
	ORDER BY TP.common_titles DESC
`, {
	model: nameBasic,
	mapToModel: true,
	replacements: {
		nconst: nconst
	}
});

Metis shows the following:

‍

Database performance Insights on Metis using a raw query to find coworkers

‍

We can see multiple table scans and millions of rows read. We can optimize that by adding indexes. Let’s see if we can make it faster just by running the code in application:

const titlesPrincipalMatchingPerson = titlePrincipal
.findAll({
	attributes: ['tconst'],
	where: {
		nconst: nconst
	}
}).then(titles => titles.map(t => t.tconst));

const otherTitlePrincipals = titlesPrincipalMatchingPerson.then(titles => titlePrincipal
.findAll({
	attributes: ['nconst'],
	where: {
		tconst: { [Op.in]: titles },
		nconst: { [Op.ne]: nconst }
	}
})).then(titles => titles.map(t => t.nconst));

const titleCrewMatchingPerson = titleCrew
.findAll({
	where: {
		[Op.or]: [
			{ directors: { [Op.like]: '%' + nconst + '%' } },
			{ writers: { [Op.like]: '%' + nconst + '%' } }
		]
	}
}).then(titles => {
	return titles
	.filter(t => (t.directors || "").split(",").indexOf(nconst) >= 0 || (t.writers || "").split(",").indexOf(nconst) >= 0)
	.map(t => [...new Set([(t.directors || "").split(","), (t.writers || "").split(",")].flat())].filter(n => n != nconst && n != "" && n))
});

const allTeammates = Promise.all([otherTitlePrincipals, titleCrewMatchingPerson]).then(nconsts => {
	return nconsts.flat().filter(n => n && n != "");
});

const topTeammates = allTeammates.then(nconsts => {
	const counts = nconsts
	.reduce(
		(entryMap, e) => {
			entryMap[e] = (entryMap[e] || 0) + 1;
			return entryMap;
		},
		{}
	);
	const keys = Object.keys(counts);
	const countsWithKeys = keys.map(k => [counts[k], k]);
	countsWithKeys.sort((pair1, pair2) => pair2[0] - pair1[0]);
	const topResults = countsWithKeys.splice(0,5);
	return topResults;
});

return topTeammates.then(countsWithKeys => nameBasic
.findAll({
	where: {
		nconst: { [Op.in]: countsWithKeys.map(c => "" + c[1]) }
	}
}).then(actors => actors.map(a => {
	a.common_titles = countsWithKeys.filter(c => c[1] == a.nconst)[0][0];
	return a;
})));

First, we want to take that data from the title_principals table. We start with a known nconst, find all movies for a given person, and then for all the movies we find other people.

Next, we take similar data from title_crew. We then parse the data, aggregate it, and find top teammates. Finally, we get the data from the name_basics table.

‍

Database performance Insights after parsing & aggregating teammates

‍

We see that Metis identifies indexes that may improve the scenario. Even without these indexes, the query running in application finishes in 6 seconds vs 19 seconds with raw query. That’s a good starting point. Once again, we need to add improvements and measure them to see how things work in practice.

Summary

We examined various real scenarios with Sequelize to see how we can improve the performance. In previous parts of this series we relied on the database only to make it faster. This time we can do some processing with ORM and the application to change the performance characteristics. Metis shows good insights into all what we do, so we can track the optimizations and performance benefits as we proceed with the implementation changes. We can do the same with our actual applications even if we have very few rows in the database.

How Metis Optimized Queries Executed by Sequelize - Part 2

Given two people, list what movies they appeared in together

List all of the cast and crew in a given movie

Find the most prolific actor in a given period

Find most prolific actors in a given genre

Finding most common coworkers

Summary

Keep Your Databases Fast

Harness Your Database Complexity

Prevent Your Data Workflows From Breaking

Never worry about your
database again!

Keep Your Databases Fast

Harness Your Database Complexity

Prevent Your Data Workflows From Breaking

Never worry about your database again!

Contact Us

Never worry about your
database again!