Edit: In the original post, I made a mistake about Debezium versions: it should be 1.2.2 to 1.3.0, not 2.0 to 3.0. And we have experienced a production incident with 1.2.1 so my recommendation is – go for 1.3.1, not any other version (as for November 2020).
Debezium is a popular Open Source change data capture (CDC) solution. It’s based on a set of well-established components (Kafka, Kafka Connect, ZooKeeper) and used successfully in production by companies like Dehlivery or BlaBlaCar. However, the setup is not as straightforward as one may think, and you may encounter some obstacles. In this post, I will share some tips originating from hooking up Debezium to a quite big and busy OLTP Postgres database.
Originally, Debezium was connected to a Postgres DB using a wal2json or decoderbufs plugins. Don’t use them – as native logical replication became available in Postgres 10, a better option called “pgoutput” was introduced in Debezium. From the database perspective, the Debezium connector will be treated just like another logical replication subscriber, and usual rules of logical replication will apply.
Check wal_level
Logical decoding – on which logical replication and Debezium depend on – requires wal_level configuration parameter to be set to “logical”, and its change requires Postgres restart. Check this beforehand and schedule a database restart if necessary.
Pick right Debezium version
NEVER, EVER use Debezium versions between 1.2.2 and 1.3.0. Those versions come with a bug which is hard to understand for anoyne who isn’t a Postgres expert, but its consequences are simple and dangerous : using such buggy Debezium version will take the master database down sooner or later. This is because Debezium will consume data changes, but won’t confirm the consumption to the database server. The server will retain all WAL segments since replication slot creation, and you will eventually run out of disk space (or money, if you use some pay-as-you-go storage like Amazon EFS or keep adding new disks).
Check database logs
When Debezium fails to connect, it returns a “Could not obtain encoding for database …” error message. Usually it has NOTHING to do with encoding, it just means that it couldn’t connect. Check the Postgres log for details – it might be a wrong password or no pg_hba.conf rule. If the database log is clean, look for network issues.
Create slot and publication manually
While Debezium is capable of creating required database entities (logical replication slot and publication) itself, it’s a bad idea. First – it requires superuser privileges for Debezium user. Second – it will create publication for all tables, what can have unpleasant side effects – for example it will silently block all write queries to tables that don’t have a primary key. Third – you may run into an “publication does not exist” error even if the publication is indeed created correctly, see here.
My advice is – create a publication and a replication slot manually before setting up a connector. Beware though – creation order matters! You have to create publication before you create replication slot, otherwise you will see “publication does not exist” error. The correct order is:
CREATE PUBLICATION debezium FOR TABLE <your table list here>;
SELECT pg_create_logical_replication_slot('debezium', 'pgoutput');
You are free to change the table list after that at any time using ALTER PUBLICATION.
On the Debezium side, set:
"publication.name":"debezium",
"slot.name":"debezium"
in your JSON config. Of course the “debezium” name isn’t mandatory, you can be more specific – this is just an example.
Also, especially for high-volume databases, it’s recommended to use the Avro format for serialization:
"key.converter": "io.confluent.connect.avro.AvroConverter",
"value.converter":"io.confluent.connect.avro.AvroConverter"
Monitor closely
As with any replication slot, if the subscriber doesn’t consume changes, WAL segments will start to pile up. You have to monitor if the Debezium is up, if the replication lag isn’t too big, and if the database server has some free disk space. Most of the time, if Debezium goes down, a simple restart will suffice, but you have to know when to do it. Debezium is much more crash-prone than Postgres, so a monitoring solution is a must-have.
If you look into pg_stat_replication system view in Postgres, here are some differences between Debezium and regular logical replication:
- The replay lag (or difference between pg_current_wal_lsn and replay_lsn) can become quite big and drop suddenly in one minute intervals. This is because Debezium commits change consumption every minute.
- The reply_time in pg_stat_replication is in the past – in my case it was 1999-12-21 03:26:12 – and it appears to be weird, but normal.