Categories
Uncategorized

Setting up PostgreSQL for Debezium

Edit: In the original post, I made a mistake about Debezium versions: it should be 1.2.2 to 1.3.0, not 2.0 to 3.0. And we have experienced a production incident with 1.2.1 so my recommendation is – go for 1.3.1, not any other version (as for November 2020).

Debezium is a popular Open Source change data capture (CDC) solution. It’s based on a set of well-established components (Kafka, Kafka Connect, ZooKeeper) and used successfully in production by companies like Dehlivery or BlaBlaCar. However, the setup is not as straightforward as one may think, and you may encounter some obstacles. In this post, I will share some tips originating from hooking up Debezium to a quite big and busy OLTP Postgres database.

Originally, Debezium was connected to a Postgres DB using a wal2json or decoderbufs plugins. Don’t use them – as native logical replication became available in Postgres 10, a better option called “pgoutput” was introduced in Debezium. From the database perspective, the Debezium connector will be treated just like another logical replication subscriber, and usual rules of logical replication will apply.

Check wal_level

Logical decoding – on which logical replication and Debezium depend on – requires wal_level configuration parameter to be set to “logical”, and its change requires Postgres restart. Check this beforehand and schedule a database restart if necessary.

Pick right Debezium version

NEVER, EVER use Debezium versions between 1.2.2 and 1.3.0. Those versions come with a bug which is hard to understand for anoyne who isn’t a Postgres expert, but its consequences are simple and dangerous : using such buggy Debezium version will take the master database down sooner or later. This is because Debezium will consume data changes, but won’t confirm the consumption to the database server. The server will retain all WAL segments since replication slot creation, and you will eventually run out of disk space (or money, if you use some pay-as-you-go storage like Amazon EFS or keep adding new disks).

Check database logs

When Debezium fails to connect, it returns a “Could not obtain encoding for database …” error message. Usually it has NOTHING to do with encoding, it just means that it couldn’t connect. Check the Postgres log for details – it might be a wrong password or no pg_hba.conf rule. If the database log is clean, look for network issues.

Create slot and publication manually

While Debezium is capable of creating required database entities (logical replication slot and publication) itself, it’s a bad idea. First – it requires superuser privileges for Debezium user. Second – it will create publication for all tables, what can have unpleasant side effects – for example it will silently block all write queries to tables that don’t have a primary key. Third – you may run into an “publication does not exist” error even if the publication is indeed created correctly, see here.

My advice is – create a publication and a replication slot manually before setting up a connector. Beware though – creation order matters! You have to create publication before you create replication slot, otherwise you will see “publication does not exist” error. The correct order is:

CREATE PUBLICATION debezium FOR TABLE <your table list here>;
SELECT pg_create_logical_replication_slot('debezium', 'pgoutput');

You are free to change the table list after that at any time using ALTER PUBLICATION.

On the Debezium side, set:

"publication.name":"debezium",
"slot.name":"debezium"

in your JSON config. Of course the “debezium” name isn’t mandatory, you can be more specific – this is just an example.

Also, especially for high-volume databases, it’s recommended to use the Avro format for serialization:

"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter":"io.confluent.connect.avro.AvroConverter"

Monitor closely

As with any replication slot, if the subscriber doesn’t consume changes, WAL segments will start to pile up. You have to monitor if the Debezium is up, if the replication lag isn’t too big, and if the database server has some free disk space. Most of the time, if Debezium goes down, a simple restart will suffice, but you have to know when to do it. Debezium is much more crash-prone than Postgres, so a monitoring solution is a must-have.

If you look into pg_stat_replication system view in Postgres, here are some differences between Debezium and regular logical replication:

  • The replay lag (or difference between pg_current_wal_lsn and replay_lsn) can become quite big and drop suddenly in one minute intervals. This is because Debezium commits change consumption every minute.
  • The reply_time in pg_stat_replication is in the past – in my case it was 1999-12-21 03:26:12 – and it appears to be weird, but normal.
Categories
Uncategorized

Solving the “3rd arg isn’t within [0,1]” problem in pgRouting

Happy PostGIS Day everyone!

(for those who don’t know: the third Wednesday of November each year is celebrated as the “GIS Day“, and the-day-after-GIS Day, is, well, the PostGIS Day!)

One of most annoying things about PostGIS is that sometimes spatial analysis functions fail due to GEOS TopologyException. This happens because of one of participating geometries doesn’t conform to validity rules. One of most common validity error is a self-intersection – a “hourglass polygon”. It’s a well known issue and has well-known, but someteimes weird (zero-sized buffer, I’m looking at you!) solutions. But the documentation states that LINESTRINGs are always valid, so there won’t be problems when working with linear geometries, right?

Wrong.

Imagine that you have some network (road, rail, electric, water…) and want to perform some network analysis on it. The way to go is to use the pgRouting extension, which is built on top on PostGIS. To convert a set of LINESTRINGs to a network topology suitable for analysis, you will want to use the pgr_nodeNetwork function first. And then you may encounter an error…

ERROR:  line_interpolate_point: 3rd arg isn't within [0,1]
 CONTEXT:  SQL statement "insert INTO public.my_network_noded (old_id,sub_id,geom) (  with cut_locations AS (
            SELECT l1id AS lid, locus
            FROM inter_loc
            -- then generates start AND end locus for each line that have to be cut buy a location point
            UNION ALL
            SELECT i.l1id  AS lid, 0 AS locus
            FROM inter_loc i left join public.my_network b on (i.l1id = b.id)
            UNION ALL
            SELECT i.l1id  AS lid, 1 AS locus
            FROM inter_loc i left join public.my_network b on (i.l1id = b.id)
            ORDER BY lid, locus
        ),
        -- we generate a row_number index column for each input line
        -- to be able to self-join the table to cut a line between two consecutive locations
        loc_with_idx AS (
            SELECT lid, locus, row_number() over (partition by lid ORDER BY locus) AS idx
            FROM cut_locations
        )
        -- finally, each original line is cut with consecutive locations using linear referencing functions
        SELECT l.id, loc1.idx AS sub_id, st_linesubstring(l.geom, loc1.locus, loc2.locus) AS geom
        FROM loc_with_idx loc1 join loc_with_idx loc2 using (lid) join public.my_network l on (l.id = loc1.lid)
        WHERE loc2.idx = loc1.idx+1
            -- keeps only linestring geometries
            AND geometryType(st_linesubstring(l.geom, loc1.locus, loc2.locus)) = 'LINESTRING') "
 PL/pgSQL function pgr_nodenetwork(text,double precision,text,text,text,text,boolean) line 229 at EXECUTE

But why? Well, linestrings are always valid, but not always simple. A common example of non-simple linestring is a closed ring – a roundabout junction, for example. The another one is a self-intersecting line. Closed rings are OK for pgRouting, but self-intersecting lines are not:

self-intersecting line
An example of self-intersecting line causing pgRouting to fail.

So, if you ever encounter the “line_interpolate_point: 3rd arg isn’t within [0,1] ” error when trying to node a network of linestrings in pgRouting, check your geometries using the ST_IsSimple function:

SELECT * FROM my_network WHERE ST_IsSimple(geom) = FALSE;

If the self-intersections are perfectly valid from business standpoint, you will have to split that lines into smaller pieces for pgRouting purposes.

Categories
Uncategorized

Data consistency issues in logical replication

In contrast to physical (streaming) replication, write access on subscriber node is not disabled. This means that DML and DDL statements like INSERT, UPDATE, DELETE or ALTER can be executed on subscriber. As the replication is unidirectional, however, these changes won’t be replicated back to publisher. This behavior can lead to problems – the replication can stop with all its consequences or the data can become out of sync. In this post, I’d like to share some tips how to deal with such problems.

Scenario 1: data INSERTed on subscriber side

Data can be inserted into a replication target table without any problems, provided that a row with identical REPLICA IDENTITY wont’ be subsequently inserted on publisher node. For example, consider the following table:

CREATE TABLE mydata(id integer primary key, value integer, description varchar);

its REPLICA IDENTITY will be using the “id” primary key, which is the default.

You can safely execute the following INSERT on subscriber:

INSERT INTO mydata VALUES (100, 1024, 'test');

…but if you execute an INSERT with identical primary key value on the publisher,

INSERT INTO mydata VALUES (100, 65536, 'test on publisher');

you will get an error on subscriber node:

ERROR: duplicate key value violates unique constraint "mydata_pkey"
DETAIL: Key (id) = (100) already exists.

The replication will stop, and WAL segments will start to pile on publisher. Luckily, the log contains all the required information: you have to DELETE the problematic row on the subscriber:

DELETE FROM mydata WHERE id=100;

and the replication will restart automatically.

Scenario 2: data UPDATEd on subscriber side

As long as required privileges are granted, UPDATE statements can be executed on subscriber without any error or warning. The replication will not be halted. The publisher doesn’t know about the change, and as the result, the values in the same row can differ between publisher and subscriber. But if the same row gets updated on publisher, the whole row will be overwritten on subscriber, and the change will be gone – even if the changes were made to different columns.

Scenario 3: data DELETEd on subscriber side

This kind of inconsistency also won’t cause replication to stop, and if your log_min_messages configuration parameter is set to lower value than “debug1”, chances are that you won’t notice it. If you do log debug messages and update/delete a row on publisher which doesn’t exist on subscriber, you should expect following log entries on the subscriber node:

DEBUG: logical replication did not find row for update in replication target relation "mydata"
DEBUG: logical replication did not find row for delete in replication target relation "mydata"

Sometimes keeping smaller dataset on subscriber is desired and such messages can be safely ignored. But what if it’s an inconsistency and has to be resolved? Unfortunately, native logical replication – unlike pglogical – doesn’t provide tools to resynchronize the table. But there is a workaround…

How to resynchronize a table in PostgreSQL logical replication?

Using officially documented tools only, you can either:

  1. Lock the table for writes on publisher, dump the data and copy this dump to subscriber, truncate table on subscriber, restore a data dump on subscriber and remove the lock on publisher.
  2. Exclude the table from current publication, truncate table on subscriber, and create a new publication-subscription pair.

But if you don’t mind editing the system catalogs by hand, there is another possibility…

WARNING: black magic ahead. Proceed at your own risk. I’m not resposible for any data loss or corruption resulting from following this advice. Always have a backup in case things go wrong. You have been warned.

The state of logically replicated tables is kept in a pg_subscription_rel table. The “srsubstate” column indicates if the table is being initialized, data is being copied, synchronized or if the replication is ready. When you change its value to ‘i’, meaning “initial”, you can force the initial sync to be started. So, in order to perform a resync a table called “mydata”, execute following statements on subscriber:

TRUNCATE TABLE mydata;
UPDATE pg_subscription_rel SET srsubstate = 'i' WHERE srrelid = (SELECT relid FROM pg_stat_user_tables WHERE schemaname='public' AND relname='mydata');

(if the table is in another schema, replace “public” with relevant schema name)

After a while, check the pg_subscription_rel table – the state should automatically change to ‘d’, next to ‘s’ and finally ‘r’, meaning that the resync is complete. If the state hangs at ‘d’, check the logs and pg_stat_replication at publisher node – if the replication fails due to duplicate primary key value, purge the data on subscriber, check if the row count is actually zero, and repeat the pg_subscription_rel update.

Categories
Uncategorized

Scaling row level security to group roles

Row level security is a great Postgres feature that allows to grant privileges to selected rows only, without having to create additional data structures. The common setup is to add a column with users’ names, and a policy that compares this column value with CURRENT_USER:

CREATE TABLE issues(id integer generated always as identity primary key, title varchar, description varchar, status varchar, assignee varchar);

ALTER TABLE issues ENABLE ROW LEVEL SECURITY;

CREATE POLICY issue_rls ON issues FOR ALL TO public USING (assignee = CURRENT_USER);

And then, any user that isn’t a superuser or hasn’t the BYPASSRLS privilege will see (and be able to edit) only rows with relevant “assignee” value. And it works well in an environment with only handful of users. But as the number of users grow, it can become difficult to maintain: sometimes we’ll want to assign some rows to multiple users, or a whole group of users. What to do then?

Enter the pg_has_role function. It checks if a particular user has given role. Its usage is quite simple:

SELECT pg_has_role('chinook', 'service', 'USAGE');

the first argument is the user name, the second is a role name, and the third is a bit tricky: ‘MEMBER’ determines if user is an actual member of role or can execute SET ROLE to become a member, while ‘USAGE’ checks for actual membership only. As users are in fact specific case of roles with LOGIN privilege (for very old Postgres versions – 8.0 and earlier – this isn’t true, but those versions don’t have row level security anyway) , this function works equally well for individual users and group roles.

So, we’ll modify the policy as such:

DROP POLICY issue_rls;
CREATE POLICY issue_rls ON issues FOR ALL TO public USING (pg_has_role(current_user, assignee, 'USAGE'));

…and now the rows can be assigned to individual users or whole roles.

Categories
Uncategorized

Pitfalls and quirks of logical replication in Postgres 12

Logical replication, in a nutshell, is a process of “replaying” data changes in another database. The first attempts to implement replication in Postgres – Slony, Bucardo – were logical replication solutions. In contrast to binary (streaming) replication, it offers greater flexibility: data can be replicated selectively – only relevant databases and tables, the replica server remains a fully functional instance that can have its own users, security rules and non-replicated databases, and in some cases the performance is better.

However, the logical replication is not as foolproof as binary replication, and without proper care it can lead to a primary server crash. I’d like to share some thoughts after setting up a logical replication in a large-ish (one terabyte) production database.

Do not proceed without monitoring

Seriously, if you don’t have a robust monitoring solution that will warn you against system abnormalities – especially running out of disk space – set it up and test before any attempt to implement logical replication. This is because logical replication can break silently without any SQL exceptions, cause the WAL files to pile up, fill the entire disk and bring the primary server down. At least two things have to be monitored:

  • disk usage,
  • errors in Postgres log.

Only with such early warning system, you will be able to fix any issues before they cause an outage.

Replicate to the same server? Possible, but…

Logical replication can be used to replicate data between databases in a single Postgres cluster. It’s a perfectly valid setup, but it requires special treatment: you have to create logical replication slot first, and with the slot already in place, create a subscription pointing to that slot. If you try to set up replication in a default way – with automatic slot creation – the CREATE SUBSCRIPTION command will hang. No errors or warnings – just a hung statement.

Be careful with PUBLICATION FOR ALL TABLES

Like all previous logical replication solutions, native replication doesn’t replicate data structure changes. If any DDL command is issued (CREATE, ALTER, DROP) it must be executed manually both on primary and replica(s). The FOR ALL TABLES modifier of a CREATE PUBLICATION statement doesn’t change this limitation. Instead, it will cause the replication to stop as soon as you (or your application, or a Postgres extension) issue a CREATE TABLE statement. Also, FOR ALL TABLES will automatically includes any tables created by extensions (like spatial_ref_sys from PostGIS), and tables that don’t have a primary key or replica identity – both cases are problematic.

Spatial is special

The spatial_ref_sys table is a part of PostGIS extension, and it’s populated by CREATE EXTENSION postgis; statement. More often than not, it shouldn’t be replicated, as every PostGIS instance populates it itself. If you have to replicate it (for example, you work with coordinate systems that aren’t part of EPSG registry), remember to TRUNCATE the spatial_ref_sys table on replica before creating subscription.

Review the primary keys carefully

A table eligible for logical replication must have a primary key constraint or replica identity – that’s the rule. It’s nothing new as it was the same with previous logical replication solutions, but its enforcement in native Postgres replication is at least weird. You are allowed to add a table without PK or replica identity to a publication, it won’t cause any error, but… it will block any write activity to it!

You will need to add a PK as soon as possible (or if you can’t afford an exclusive lock for unique index creation time, replica identity full will be just fine, but less performant) to unlock write access.

After adding a new table, refresh publication

It’s not enough to execute “ALTER PUBLICATION … ADD TABLE …” on primary server in order to add a new table to replication. You have to log in into a replica database and execute the following statement:

ALTER SUBSCRIPTION … REFRESH PUBLICATION;

Summary

Logical replication is great and has many use cases: separating transactional and analitical workload, aggregating changes from multiple databases, and so on. It is not as simple as it looks at first glance. Follow the rules and use it with caution, and you should enjoy a fast, flexible data replication.

Categories
Uncategorized

Batch operations made easy with psql’s /gexec

Any DBA will sometimes have to execute some kind of batch DDL / DCL operation, for example:

  • grant privileges for all tables in all schemas to a specific user,
  • create a spatial index for all GEOMETRY columns that doesn’t have such index,
  • add a list of tables to logical replication,
  • alter all INTEGER columns to BIGINT,
  • … and many more.

If the database structure is simple with only a handful of tables, this can be done manually without much effort. But what if there are dozens of schemas, hundreds of tables and thousands of columns? Possible automation solutions include writing a script in a scripting language of choice (Python, Perl…) or a pl/pgsql procedure. But there’s one more possibility: use one of lesser known commandline psql client superpowers – the \gexec command.

How it works? It takes the query output and executes it as a SQL statement. For example, to grant SELECT privileges for all tables in all schemas to “chinook” user, you will want to execute the following command:

SELECT 'GRANT SELECT ON ALL TABLES IN SCHEMA ' || quote_ident(schema_name) || ' TO chinook;' FROM information_schema.schemata; \gexec

psql will output the result – a set of “GRANT SELECT ON ALL TABLES IN SCHEMA …” statements – and execute them one by one. The use of quote_ident function around object name isn’t technically necessary, but helpful – if any name requires double quotes, the database engine will add them automatically.

Personally, I like this method very much because it’s a huge time saver, and doesn’t require any boilerplate code.

Categories
Uncategorized

Checking pg_hba.conf content using SQL

Creating new users and granting database access is a common DBA’s task. But unless you have a very permissive network access policy like “all all 0.0.0.0/0 md5”, you may sometimes be unsure if current HBA rules are sufficient for newly created user to get access (and not return to you with angry “it doesn’t work!” instead). You have to leave psql / pgAdmin / another-tool of choice, and check the pg_hba.conf file content, right? Well, with PostgreSQL 10 or newer, you don’t have to. Just run:

SELECT * FROM pg_hba_file_rules;

and you’ll see all the rules in a tabular format.

line_number | type  |   database    | user_name  |  address  |                 netmask                 | auth_method | options | error 
-------------+-------+---------------+------------+-----------+-----------------------------------------+-------------+---------+-------
84 | local | {all} | {all} | | | trust | |
86 | host | {all} | {all} | 127.0.0.1 | 255.255.255.255 | md5 | |
88 | host | {all} | {all} | ::1 | ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff | md5 | |
91 | local | {replication} | {all} | | | trust | |
92 | host | {replication} | {all} | 127.0.0.1 | 255.255.255.255 | trust | |
93 | host | {replication} | {all} | ::1 | ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff | trust | |
94 | host | {savannah} | {elephant} | 0.0.0.0 | 0.0.0.0 | md5 | |
(7 rows)

Categories
Uncategorized

QGIS 3.14 Pi supports GENERATED columns

QGIS, a Desktop GIS package, was a PostgresSQL/PostGIS client since its very beginning. The latest release, 3.14 codename “Pi”, along shiny new features, adds also bug fixes making use of Postgres’ GENERATED columns possible. This might not be a game-changer, but it definitely makes a PostGIS DBA’s life easier. Why? Read on.

Generated columns are similar in functionality to virtual fields – they are populated with computed values – but implemented at the database level. That means every application using the DB, be it another QGIS instance, Geoserver, web apps, even ArcGIS can use those columns, while virtual fields are a QGIS-only feature, and bound to a specific project. Threre are two kinds of generated columns, IDENTITY and user-defined.

IDENTITY is the new SERIAL

Every decent database has some method of generating row identifiers: MySQL has AUTO INCREMENT, Oracle has ROWID, MongoDB has _id and so on. In Postgres, the preferred way to deal with this issue was to use the SERIAL pseudo-type:

CREATE TABLE fields(id SERIAL PRIMARY KEY, name VARCHAR, geom GEOMETRY);

This code generated a INTEGER column, a sequence named “fields_id_seq”, a DEFAULT value being the next value of sequence, and a PRIMARY KEY constraint (in fact, SERIAL doesn’t have to be a primary key, but that’s its main use case). The type name “serial” was gone and the column was treated as a regular INTEGER column.

It worked, but had its downsides: for example, the permissions for sequence have to be granted separately – with GRANT ALL or GRANT INSERT ON TABLE, but without GRANT USAGE ON SEQUENCE, the user wouldn’t be able to add new features.

The new approach is defined by SQL standard, implemented in Postgres 10 and newer, for which the syntax is:

CREATE TABLE fields(id INTEGER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, name VARCHAR, geom GEOMETRY);

Indeed, more characters to type. But no “GRANT USAGE ON SEQUENCE…” anymore. Sadly, the GUI tools in QGIS haven’t caught up yet and creating the table using a form, or importing an existing layer into PostGIS will yield the old-fashioned SERIAL column.

User-defined generated columns

Especially in a spatial database, sometimes there is a requirement of a field having a computed value derived from other properties: X/Y coordinates, area, perimeter, or length to name a few. Before Postgres 12, this could be implemented using triggers:

ALTER TABLE fields ADD COLUMN area NUMERIC;
CREATE FUNCTION calc_area() RETURNS TRIGGER AS $$
BEGIN
NEW.area = ST_Area(NEW.geom);
RETURN NEW;
END;

$$
language plpgsql;
CREATE TRIGGER calc_area_trigger BEFORE INSERT OR UPDATE ON fields FOR EACH ROW EXECUTE PROCEDURE calc_area();

For 12 and newer, there exists another possibility: generate the value using GENERATED syntax:

ALTER TABLE fields ALTER COLUMN area NUMERIC GENERATED ALWAYS AS (ST_Area(geom)) STORED;

This means same “bang” (functionality) for less “buck” (code). No need to create a function and a trigger, just an expression and you are done. While QGIS currently doesn’t automatically block input fields in attribute form for such columns, any user-supplied value will be overwritten on session end (same as with triggers).

Of course, triggers aren’t going anywhere: for more complicated cases they will be still necessary. But for simple area/length/coordinate calculation, GENERATED columns offer a simpler, declarative approach.

So if you have a PostGIS instance based on Postgres 12, use QGIS, and haven’t upgraded your QGIS yet – this is one more reason to do so!

Categories
Uncategorized

PostGIS vs earthdistance for spatial queries

So ya thought ya might not need PostGIS for location-based queries in PostgreSQL…

As a PostGIS afficionado, I became triggered by this post. But after a while, I decided to give earthdistance a try – it’s just a piece of technology, after all. And sticking fanatically to one or another isn’t the best way to go.

Round one: setup

For test purposes, I imported a full Geonames dump (the detailed setup is described in a previous post).

The bootstraping required for earthdistance consists of creating an extension:

CREATE EXTENSION earthdistance CASCADE;

(the CASCADE modifier is needed for “cube” extension, which is a required dependency)

The earthdistance extension can work out of the box with regular numeric “latitude” and “longitude” columns, but it has two problems: the distance units are hardcoded to statute miles, and there is no index support. Using miles in continental Europe and sequential scan on a table with 12 million rows are both bad ideas. Fortunately, there is an option of using index and meters: cube-based earthdistance functions, that use a special “earth” type. So I created an index using GiST method and ll_to_earth function:

CREATE INDEX ON geoname USING GIST(ll_to_earth(latitude, longitude));

Creating this index took 11 minutes and 57 seconds on a laptop.

For PostGIS, the CREATE EXTENSION step is also required:

CREATE EXTENSION postgis;

(no CASCADE here). After that, we can create a geography column (yes, not geometry – our test dataset has a global coverage and lon-lat coordinates, a perfect use case for geography and bad for geometry):

ALTER TABLE geoname ADD COLUMN geom GEOGRAPHY;
UPDATE geoname SET geom = ST_MakePoint(longitude, latitude)::geography;

-- for Postgres 12 and newer, if you don't plan to edit location data in specialized GIS software, you can also use a generated column instead:

ALTER TABLE geoname ADD COLUMN geom GEOGRAPHY GENERATED ALWAYS AS (ST_MakePoint(longitude, latitude)::geography) STORED;

This took 7 mins 47 secs. If you are into GIS, you may notice that I skipped all the coordinate systems-related black magic. That’s because PostGIS assumes coordinates in geography type to be WGS-84, aka “GPS coordinates”, and specyfying a coordinate system is not required. The second step is to create an index:

CREATE INDEX ON geoname USING GIST(geom);

which took 2 minutes and 20 seconds.

Despite the fact that the PostGIS setup is a bit more complicated (new column and index instead of a functional index on existing columns), PostGIS is actually a bit faster here.

Round two: querying

After the necessary bootstrapping is done, it’s time to do some spatial queries like “find 10 nearest pubs in 10 kilometer radius”. As GeoNames dataset doesn’t include pubs (you’ll need OpenStreetMap data for that), I queried for nearest mountain peaks located within 10 kilometers from Zakopane railway station.

For PostGIS, such query is done using ST_DWithin (for filtering) and ST_Distance (for precise distance calculation and sorting) functions:

SELECT name, elevation, ST_Distance(geom, ST_MakePoint(19.95,49.3)) AS dist FROM geoname WHERE fclass='T' AND elevation !='' AND ST_DWithin(ST_MakePoint(19.95,49.3), geom, 10000) ORDER BY dist ASC;

The query was executed in 3 ms.

Next, the earthdistance way:

SELECT name, elevation, earth_distance(ll_to_earth(latitude, longitude), ll_to_earth(49.3, 19.95)) AS dist FROM geoname WHERE fclass='T' AND elevation !='' AND earth_box(ll_to_earth(49.3, 19.95), 10000) @> ll_to_earth(latitude, longitude) AND earth_distance(ll_to_earth(latitude, longitude), ll_to_earth(49.3, 19.95)) < 10000 ORDER BY dist ASC;

This one was executed in 4.5 ms.

Note that the earth_box function alone is not sufficient. The WHERE clause must be refined with additional earth_distance -based filter, and the docs explictly say so:

ome points in this box are further than the specified great circle distance from the location, so a second check using earth_distance should be included in the query.

Round three: precision

The earthdistance extension assumes a spherical Earth, while PostGIS geography uses a WGS84 ellipsoid as the Earth model (same as used by GPS system). The distance calculations in PostGIS are thus more precise. But what’s the difference?

SELECT name, elevation, earth_distance(ll_to_earth(latitude, longitude), ll_to_earth(49.3, 19.95)),
ST_Distance(geom, ST_MakePoint(19.95, 49.3))
FROM geoname
WHERE fclass='T' AND elevation != '' AND
ST_DWithin(geom, ST_MakePoint(19.95, 49.3), 10000)
ORDER BY st_distance ASC LIMIT 3;

name | elevation | earth_distance | st_distance
------------------------------------------------
Gubałówka | 1126 | 1813.123 | 1816.16
Samkowa Czuba | 1189 | 3528.23 | 3527.26
Krokiew | 1378 | 3770.05 | 3767.39

At 50 degrees of latitude, the earthdistance’s results are about 1-3 meters off.

Conclusion

For very basic spatial queries, both Postgres extensions offer very similar performance, with PostGIS being slightly faster in some cases and offering better precision. The accuracy provided by earthdistance lies within the error margin of an average GPS receiver (but the hardware is continously getting better, and the sphere-based calculations aren’t!). My recommendation is: if you can use the PostGIS extension, do so. The basic setup is not that complicated (you don’t have to know anything about SRIDs and projections when using the geography type!), and you get better precision and slightly better performance for free. Also, your design will be future-proof: if any more sophisticated spatial features like geofencing, measuring distance along a line, or matching a GPS track to street network are necessary, PostGIS will be there to help. For constrained environments, where installing another software package is either impossible or requiring a mountain of paperwork (like in a bank), the earthdistance remains a viable option.

Categories
Uncategorized

GiST vs GIN index for LIKE searches – a comparison

A LIKE operator is a very important tool in text searches in any SQL database. Unfortunately, with default configuration it triggers a sequential scan of the whole table, and for bigger datasets its performance is quite bad.

For some use cases, when the wildcard (percent sign) is always at the end of a search term, a simple index modifier – varchar_pattern_ops – added to a standard BTree index will be sufficient to speed up LIKE queries. But when the requirement is more complicated, with BTree you are out of luck.

This is where the pg_trgm extension kicks in. With it enabled, you are able to create a GiST or GIN index with respectively gist_trgm_ops or gin_trgm_ops modifier, and query textual columns more efficiently. But which index type is better? Let’s try it out.

I have downloaded and imported the GeoNames full dump. It consists of 12 million rows, so it’s big enough to be a non-trivial dataset. Then I made two separate tables, geoname_gist and geoname_gin, and filled them with data (the code below is for psql, using a \copy command):

CREATE TABLE geoname_gist(geonameid integer primary key, name varchar, asciiname varchar, alternatenames varchar, latitude float, longitude float, featureclass varchar, featurecode varchar, countrycode2 varchar, cc3 varchar, admin1code varchar, admin2code varchar, admin3code varchar, admin4code varchar, population integer, elevation varchar dem varchar, timezone varchar, modified_at date);
CREATE TABLE geoname_gin (LIKE geoname_gist);
\copy geoname_gist from '/home/me/Downloads/allCountries.txt'
\copy geoname_gin from '/home/me/Downloads/allCountries.txt'

After that, I created indexes:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX ON geoname_gist USING gist(name gist_trgm_ops);
CREATE INDEX ON geoname_gin USING gin(name gin_trgm_ops);

The GiST index took 12 minutes 10 seconds to build on a i7 laptop, while the GIN was ready in 2 minutes 50 secs. And what about the disk usage? Let the results speak for themselves:

SELECT pg_size_pretty(pg_relation_size('geoname_gist_name_idx'));

pg_size_pretty
---------------
1940 MB
(1 row)

SELECT pg_size_pretty(pg_relation_size('geoname_gin_name_idx'));

pg_size_pretty
---------------
323 MB
(1 row)

And then, I checked the performance of both indexes using some common name suffixes in Central Europe:

EXPLAIN ANALYZE SELECT name FROM geoname_gist WHERE name LIKE '%wice';
EXPLAIN ANALYZE SELECT name FROM geoname_gist WHERE name LIKE '%ovo';
EXPLAIN ANALYZE SELECT name FROM geoname_gist WHERE name LIKE '%burg';

EXPLAIN ANALYZE SELECT name FROM geoname_gin WHERE name LIKE '%wice';
EXPLAIN ANALYZE SELECT name FROM geoname_gin WHERE name LIKE '%ovo';
EXPLAIN ANALYZE SELECT name FROM geoname_gin WHERE name LIKE '%burg';

Same queries, same dataset, very similar execution plans but wildly different results: GiST index usage resulted in 1-3 seconds execution time, while with GIN index it took only 50-100 miliseconds to complete.

Takeaway: always use GIN for trigram indexing, and if your database suffers from poor LIKE performance, check not only the scan type, but also the indexing method.