Categories
Uncategorized

GiST vs GIN index for LIKE searches – a comparison

A LIKE operator is a very important tool in text searches in any SQL database. Unfortunately, with default configuration it triggers a sequential scan of the whole table, and for bigger datasets its performance is quite bad.

For some use cases, when the wildcard (percent sign) is always at the end of a search term, a simple index modifier – varchar_pattern_ops – added to a standard BTree index will be sufficient to speed up LIKE queries. But when the requirement is more complicated, with BTree you are out of luck.

This is where the pg_trgm extension kicks in. With it enabled, you are able to create a GiST or GIN index with respectively gist_trgm_ops or gin_trgm_ops modifier, and query textual columns more efficiently. But which index type is better? Let’s try it out.

I have downloaded and imported the GeoNames full dump. It consists of 12 million rows, so it’s big enough to be a non-trivial dataset. Then I made two separate tables, geoname_gist and geoname_gin, and filled them with data (the code below is for psql, using a \copy command):

CREATE TABLE geoname_gist(geonameid integer primary key, name varchar, asciiname varchar, alternatenames varchar, latitude float, longitude float, featureclass varchar, featurecode varchar, countrycode2 varchar, cc3 varchar, admin1code varchar, admin2code varchar, admin3code varchar, admin4code varchar, population integer, elevation varchar dem varchar, timezone varchar, modified_at date);
CREATE TABLE geoname_gin (LIKE geoname_gist);
\copy geoname_gist from '/home/me/Downloads/allCountries.txt'
\copy geoname_gin from '/home/me/Downloads/allCountries.txt'

After that, I created indexes:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX ON geoname_gist USING gist(name gist_trgm_ops);
CREATE INDEX ON geoname_gin USING gin(name gin_trgm_ops);

The GiST index took 12 minutes 10 seconds to build on a i7 laptop, while the GIN was ready in 2 minutes 50 secs. And what about the disk usage? Let the results speak for themselves:

SELECT pg_size_pretty(pg_relation_size('geoname_gist_name_idx'));

pg_size_pretty
---------------
1940 MB
(1 row)

SELECT pg_size_pretty(pg_relation_size('geoname_gin_name_idx'));

pg_size_pretty
---------------
323 MB
(1 row)

And then, I checked the performance of both indexes using some common name suffixes in Central Europe:

EXPLAIN ANALYZE SELECT name FROM geoname_gist WHERE name LIKE '%wice';
EXPLAIN ANALYZE SELECT name FROM geoname_gist WHERE name LIKE '%ovo';
EXPLAIN ANALYZE SELECT name FROM geoname_gist WHERE name LIKE '%burg';

EXPLAIN ANALYZE SELECT name FROM geoname_gin WHERE name LIKE '%wice';
EXPLAIN ANALYZE SELECT name FROM geoname_gin WHERE name LIKE '%ovo';
EXPLAIN ANALYZE SELECT name FROM geoname_gin WHERE name LIKE '%burg';

Same queries, same dataset, very similar execution plans but wildly different results: GiST index usage resulted in 1-3 seconds execution time, while with GIN index it took only 50-100 miliseconds to complete.

Takeaway: always use GIN for trigram indexing, and if your database suffers from poor LIKE performance, check not only the scan type, but also the indexing method.