Elephant Tamer – Page 2 – On Postgres, PostGIS, geospatial data and databases.

Checking pg_hba.conf content using SQL

Post author By elephanttamer
Post date July 9, 2020

Creating new users and granting database access is a common DBA’s task. But unless you have a very permissive network access policy like “all all 0.0.0.0/0 md5”, you may sometimes be unsure if current HBA rules are sufficient for newly created user to get access (and not return to you with angry “it doesn’t work!” instead). You have to leave psql / pgAdmin / another-tool of choice, and check the pg_hba.conf file content, right? Well, with PostgreSQL 10 or newer, you don’t have to. Just run:

SELECT * FROM pg_hba_file_rules;

and you’ll see all the rules in a tabular format.

line_number | type  |   database    | user_name  |  address  |                 netmask                 | auth_method | options | error 

-------------+-------+---------------+------------+-----------+-----------------------------------------+-------------+---------+-------

          84 | local | {all}         | {all}      |           |                                         | trust       |         | 

          86 | host  | {all}         | {all}      | 127.0.0.1 | 255.255.255.255                         | md5         |         | 

          88 | host  | {all}         | {all}      | ::1       | ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff | md5         |         | 

          91 | local | {replication} | {all}      |           |                                         | trust       |         | 

          92 | host  | {replication} | {all}      | 127.0.0.1 | 255.255.255.255                         | trust       |         | 

          93 | host  | {replication} | {all}      | ::1       | ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff | trust       |         | 

          94 | host  | {savannah}    | {elephant} | 0.0.0.0   | 0.0.0.0                                 | md5         |         | 

(7 rows)

Uncategorized

QGIS 3.14 Pi supports GENERATED columns

Post author By elephanttamer
Post date June 25, 2020

QGIS, a Desktop GIS package, was a PostgresSQL/PostGIS client since its very beginning. The latest release, 3.14 codename “Pi”, along shiny new features, adds also bug fixes making use of Postgres’ GENERATED columns possible. This might not be a game-changer, but it definitely makes a PostGIS DBA’s life easier. Why? Read on.

Generated columns are similar in functionality to virtual fields – they are populated with computed values – but implemented at the database level. That means every application using the DB, be it another QGIS instance, Geoserver, web apps, even ArcGIS can use those columns, while virtual fields are a QGIS-only feature, and bound to a specific project. Threre are two kinds of generated columns, IDENTITY and user-defined.

IDENTITY is the new SERIAL

Every decent database has some method of generating row identifiers: MySQL has AUTO INCREMENT, Oracle has ROWID, MongoDB has _id and so on. In Postgres, the preferred way to deal with this issue was to use the SERIAL pseudo-type:

CREATE TABLE fields(id SERIAL PRIMARY KEY, name VARCHAR, geom GEOMETRY);

This code generated a INTEGER column, a sequence named “fields_id_seq”, a DEFAULT value being the next value of sequence, and a PRIMARY KEY constraint (in fact, SERIAL doesn’t have to be a primary key, but that’s its main use case). The type name “serial” was gone and the column was treated as a regular INTEGER column.

It worked, but had its downsides: for example, the permissions for sequence have to be granted separately – with GRANT ALL or GRANT INSERT ON TABLE, but without GRANT USAGE ON SEQUENCE, the user wouldn’t be able to add new features.

The new approach is defined by SQL standard, implemented in Postgres 10 and newer, for which the syntax is:

CREATE TABLE fields(id INTEGER GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, name VARCHAR, geom GEOMETRY);

Indeed, more characters to type. But no “GRANT USAGE ON SEQUENCE…” anymore. Sadly, the GUI tools in QGIS haven’t caught up yet and creating the table using a form, or importing an existing layer into PostGIS will yield the old-fashioned SERIAL column.

User-defined generated columns

Especially in a spatial database, sometimes there is a requirement of a field having a computed value derived from other properties: X/Y coordinates, area, perimeter, or length to name a few. Before Postgres 12, this could be implemented using triggers:

ALTER TABLE fields ADD COLUMN area NUMERIC;
CREATE FUNCTION calc_area() RETURNS TRIGGER AS $$
BEGIN
NEW.area = ST_Area(NEW.geom);
RETURN NEW;
END;

$$
language plpgsql;
CREATE TRIGGER calc_area_trigger BEFORE INSERT OR UPDATE ON fields FOR EACH ROW EXECUTE PROCEDURE calc_area();

For 12 and newer, there exists another possibility: generate the value using GENERATED syntax:

ALTER TABLE fields ALTER COLUMN area NUMERIC GENERATED ALWAYS AS (ST_Area(geom)) STORED;

This means same “bang” (functionality) for less “buck” (code). No need to create a function and a trigger, just an expression and you are done. While QGIS currently doesn’t automatically block input fields in attribute form for such columns, any user-supplied value will be overwritten on session end (same as with triggers).

Of course, triggers aren’t going anywhere: for more complicated cases they will be still necessary. But for simple area/length/coordinate calculation, GENERATED columns offer a simpler, declarative approach.

So if you have a PostGIS instance based on Postgres 12, use QGIS, and haven’t upgraded your QGIS yet – this is one more reason to do so!

Uncategorized

PostGIS vs earthdistance for spatial queries

Post author By elephanttamer
Post date June 15, 2020

So ya thought ya might not need PostGIS for location-based queries in PostgreSQL…

As a PostGIS afficionado, I became triggered by this post. But after a while, I decided to give earthdistance a try – it’s just a piece of technology, after all. And sticking fanatically to one or another isn’t the best way to go.

Round one: setup

For test purposes, I imported a full Geonames dump (the detailed setup is described in a previous post).

The bootstraping required for earthdistance consists of creating an extension:

CREATE EXTENSION earthdistance CASCADE;

(the CASCADE modifier is needed for “cube” extension, which is a required dependency)

The earthdistance extension can work out of the box with regular numeric “latitude” and “longitude” columns, but it has two problems: the distance units are hardcoded to statute miles, and there is no index support. Using miles in continental Europe and sequential scan on a table with 12 million rows are both bad ideas. Fortunately, there is an option of using index and meters: cube-based earthdistance functions, that use a special “earth” type. So I created an index using GiST method and ll_to_earth function:

CREATE INDEX ON geoname USING GIST(ll_to_earth(latitude, longitude));

Creating this index took 11 minutes and 57 seconds on a laptop.

For PostGIS, the CREATE EXTENSION step is also required:

CREATE EXTENSION postgis;

(no CASCADE here). After that, we can create a geography column (yes, not geometry – our test dataset has a global coverage and lon-lat coordinates, a perfect use case for geography and bad for geometry):

ALTER TABLE geoname ADD COLUMN geom GEOGRAPHY;
UPDATE geoname SET geom = ST_MakePoint(longitude, latitude)::geography;

-- for Postgres 12 and newer, if you don't plan to edit location data in specialized GIS software, you can also use a generated column instead:

ALTER TABLE geoname ADD COLUMN geom GEOGRAPHY GENERATED ALWAYS AS (ST_MakePoint(longitude, latitude)::geography) STORED;

This took 7 mins 47 secs. If you are into GIS, you may notice that I skipped all the coordinate systems-related black magic. That’s because PostGIS assumes coordinates in geography type to be WGS-84, aka “GPS coordinates”, and specyfying a coordinate system is not required. The second step is to create an index:

CREATE INDEX ON geoname USING GIST(geom);

which took 2 minutes and 20 seconds.

Despite the fact that the PostGIS setup is a bit more complicated (new column and index instead of a functional index on existing columns), PostGIS is actually a bit faster here.

Round two: querying

After the necessary bootstrapping is done, it’s time to do some spatial queries like “find 10 nearest pubs in 10 kilometer radius”. As GeoNames dataset doesn’t include pubs (you’ll need OpenStreetMap data for that), I queried for nearest mountain peaks located within 10 kilometers from Zakopane railway station.

For PostGIS, such query is done using ST_DWithin (for filtering) and ST_Distance (for precise distance calculation and sorting) functions:

SELECT name, elevation, ST_Distance(geom, ST_MakePoint(19.95,49.3)) AS dist FROM geoname WHERE fclass='T' AND elevation !='' AND ST_DWithin(ST_MakePoint(19.95,49.3), geom, 10000) ORDER BY dist ASC;

The query was executed in 3 ms.

Next, the earthdistance way:

SELECT name, elevation, earth_distance(ll_to_earth(latitude, longitude), ll_to_earth(49.3, 19.95)) AS dist FROM geoname WHERE fclass='T' AND elevation !='' AND earth_box(ll_to_earth(49.3, 19.95), 10000) @> ll_to_earth(latitude, longitude) AND earth_distance(ll_to_earth(latitude, longitude), ll_to_earth(49.3, 19.95)) < 10000 ORDER BY dist ASC;

This one was executed in 4.5 ms.

Note that the earth_box function alone is not sufficient. The WHERE clause must be refined with additional earth_distance -based filter, and the docs explictly say so:

ome points in this box are further than the specified great circle distance from the location, so a second check using earth_distance should be included in the query.

Round three: precision

The earthdistance extension assumes a spherical Earth, while PostGIS geography uses a WGS84 ellipsoid as the Earth model (same as used by GPS system). The distance calculations in PostGIS are thus more precise. But what’s the difference?

SELECT name, elevation, earth_distance(ll_to_earth(latitude, longitude), ll_to_earth(49.3, 19.95)),
ST_Distance(geom, ST_MakePoint(19.95, 49.3))
FROM geoname
WHERE fclass='T' AND elevation != '' AND
ST_DWithin(geom, ST_MakePoint(19.95, 49.3), 10000)
ORDER BY st_distance ASC LIMIT 3;

name | elevation | earth_distance | st_distance
------------------------------------------------
Gubałówka | 1126 | 1813.123 | 1816.16
Samkowa Czuba | 1189 | 3528.23 | 3527.26
Krokiew | 1378 | 3770.05 | 3767.39

At 50 degrees of latitude, the earthdistance’s results are about 1-3 meters off.

Conclusion

For very basic spatial queries, both Postgres extensions offer very similar performance, with PostGIS being slightly faster in some cases and offering better precision. The accuracy provided by earthdistance lies within the error margin of an average GPS receiver (but the hardware is continously getting better, and the sphere-based calculations aren’t!). My recommendation is: if you can use the PostGIS extension, do so. The basic setup is not that complicated (you don’t have to know anything about SRIDs and projections when using the geography type!), and you get better precision and slightly better performance for free. Also, your design will be future-proof: if any more sophisticated spatial features like geofencing, measuring distance along a line, or matching a GPS track to street network are necessary, PostGIS will be there to help. For constrained environments, where installing another software package is either impossible or requiring a mountain of paperwork (like in a bank), the earthdistance remains a viable option.

Uncategorized

GiST vs GIN index for LIKE searches – a comparison

Post author By elephanttamer
Post date June 13, 2020

A LIKE operator is a very important tool in text searches in any SQL database. Unfortunately, with default configuration it triggers a sequential scan of the whole table, and for bigger datasets its performance is quite bad.

For some use cases, when the wildcard (percent sign) is always at the end of a search term, a simple index modifier – varchar_pattern_ops – added to a standard BTree index will be sufficient to speed up LIKE queries. But when the requirement is more complicated, with BTree you are out of luck.

This is where the pg_trgm extension kicks in. With it enabled, you are able to create a GiST or GIN index with respectively gist_trgm_ops or gin_trgm_ops modifier, and query textual columns more efficiently. But which index type is better? Let’s try it out.

I have downloaded and imported the GeoNames full dump. It consists of 12 million rows, so it’s big enough to be a non-trivial dataset. Then I made two separate tables, geoname_gist and geoname_gin, and filled them with data (the code below is for psql, using a \copy command):

CREATE TABLE geoname_gist(geonameid integer primary key, name varchar, asciiname varchar, alternatenames varchar, latitude float, longitude float, featureclass varchar, featurecode varchar, countrycode2 varchar, cc3 varchar, admin1code varchar, admin2code varchar, admin3code varchar, admin4code varchar, population integer, elevation varchar dem varchar, timezone varchar, modified_at date);
CREATE TABLE geoname_gin (LIKE geoname_gist);
\copy geoname_gist from '/home/me/Downloads/allCountries.txt'
\copy geoname_gin from '/home/me/Downloads/allCountries.txt'

After that, I created indexes:

CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX ON geoname_gist USING gist(name gist_trgm_ops);
CREATE INDEX ON geoname_gin USING gin(name gin_trgm_ops);

The GiST index took 12 minutes 10 seconds to build on a i7 laptop, while the GIN was ready in 2 minutes 50 secs. And what about the disk usage? Let the results speak for themselves:

SELECT pg_size_pretty(pg_relation_size('geoname_gist_name_idx'));

pg_size_pretty
---------------
1940 MB
(1 row)

SELECT pg_size_pretty(pg_relation_size('geoname_gin_name_idx'));

pg_size_pretty
---------------
323 MB
(1 row)

And then, I checked the performance of both indexes using some common name suffixes in Central Europe:

EXPLAIN ANALYZE SELECT name FROM geoname_gist WHERE name LIKE '%wice';
EXPLAIN ANALYZE SELECT name FROM geoname_gist WHERE name LIKE '%ovo';
EXPLAIN ANALYZE SELECT name FROM geoname_gist WHERE name LIKE '%burg';

EXPLAIN ANALYZE SELECT name FROM geoname_gin WHERE name LIKE '%wice';
EXPLAIN ANALYZE SELECT name FROM geoname_gin WHERE name LIKE '%ovo';
EXPLAIN ANALYZE SELECT name FROM geoname_gin WHERE name LIKE '%burg';

Same queries, same dataset, very similar execution plans but wildly different results: GiST index usage resulted in 1-3 seconds execution time, while with GIN index it took only 50-100 miliseconds to complete.

Takeaway: always use GIN for trigram indexing, and if your database suffers from poor LIKE performance, check not only the scan type, but also the indexing method.

Uncategorized

Hello world!

Welcome to my new blog. I’ll use it for sharing knowledge about various aspects of running PostgreSQL databases at scale and in production (and some experiments, too).