Microservices Data Synchronization Using PostgreSQL, Debezium, and NATS
A step-by-step guide how you build a data synch stack with popular technology. You'll get a short intro to CDC, the challenges and ready to copy code.
In today’s architecture, microservices architecture has emerged as a popular approach due to its scalability and flexibility. However, this architectural style brings challenges, especially regarding data synchronization across different services. Utilizing a combination of PostgreSQL, Debezium, and NATS, we can create an efficient and reliable solution for synchronizing data across microservices.
This article will briefly introduce data synchronization challenges in microservices and then talk about how to set up PostgreSQL, Debezium, and NATS to solve the problem.
The Challenge of Data Synchronization in Microservices
Microservices are designed to be loosely coupled and independently deployable, with each microservice having its database. While this independence is beneficial, it poses a significant challenge in maintaining data consistency across the system. Traditional batch processing methods through ETL (extract, transform, load) can be cumbersome and fail to offer real-time updates, which are crucial for many modern applications.
Short Intro to Change Data Capture (CDC)
Change data capture (CDC) is the process of observing all data changes written to a database and extracting them in a form that can be reflected to other systems so that the systems have an accurate copy.
For a deeper discussion about this, please refer to our previous article (Understanding Database Synchronization: An Overview of Change Data Capture).
The Components
This article aims to show how a database can be synced to other systems. I am using PostgreSQL as the source database, but you can apply this to other databases like MySQL or MariaDB. Below are short descriptions of each component I am going to use.
PostgreSQL: A Robust Database Solution
PostgreSQL, a powerful open-source object-relational database system, is widely used due to its advanced features and reliability. In our microservices architecture, each service utilizes its instance of PostgreSQL, ensuring data isolation and integrity.
Debezium: Change Data Capture
Debezium is an open-source distributed platform for change data capture (CDC). It monitors the databases and captures row-level changes, which are then emitted as event streams. Integrating Debezium with PostgreSQL allows us to capture every change made to the database in real time.
NATS: The Messaging Backbone
NATS acts as the central messaging system. NATS is known for its lightweight and simple design, which allows for high throughput and low latency in message delivery. NATS is the conduit through which data changes are communicated across different microservices.
Setting Up PostgreSQL
For CDC to work with PostgreSQL, we must ensure that we understand several vital parts and concepts in PostgreSQL that are required for CDC.
Database Write-Ahead Log (WAL). The Write-Ahead Log (WAL) ensures data integrity and reliability. Essentially, it's a logging technique where all changes to the database are written to a log before they are applied to the actual database files. In PostgreSQL, the WAL records every change made to the database. This method is crucial for maintaining the atomicity and durability properties of transactions.
Replication Slot. In PostgreSQL, replication slots are critical in the streaming replication process. It ensures that the master server will retain the WAL logs that the replicas need even when disconnected from the master. There are two types of replication slots: physical replication slots and logical replication slots.
With this knowledge, we need to update the settings for our PostgreSQL database. The most important thing to remember is to set the wal_level to logical. Additionally, we can set up the max_level_senders and max_replication_slots if necessary. So here is our docker-compose file.
version: '3.9'
services:
postgres:
image: postgres:latest
command: "-c wal_level=logical -c max_wal_senders=5 -c max_replication_slots=5"
environment:
POSTGRES_DB: glassflowdb
POSTGRES_USER: glassflowuser
POSTGRES_PASSWORD: glassflow
ports:
- "5432:5432"
volumes:
- ./data/postgres:/var/lib/postgresql/data
We can now run $ docker-compose up to start the database. For this article, let’s create a simple table that we will keep track of later on.
$ psql -h 127.0.0.1 -U glassflowuser -d glassflowdb
Password for user glassflowuser:
psql (14.10, server 16.1 (Debian 16.1-1.pgdg120+1))
WARNING: psql major version 14, server major version 16.
Some psql features might not work.
Type "help" for help.
glassflowdb=# CREATE TABLE accounts (
user_id serial PRIMARY KEY,
username VARCHAR ( 50 ) UNIQUE NOT NULL,
password VARCHAR ( 50 ) NOT NULL,
email VARCHAR ( 255 ) UNIQUE NOT NULL,
created_on TIMESTAMP NOT NULL,
last_login TIMESTAMP
);
Setting up NATS
nats:
image: nats:latest
ports:
- "4222:4222"
command:
- "--debug"
- "--http_port=8222"
- "--js"
Setting Up Debezium
We are going to use a ready-to-use version of Debezium. Which is the Debezium server.
debezium:
image: docker.io/debezium/server:latest
volumes:
- ./debezium/conf:/debezium/conf
depends_on:
- postgres
- nats
To make it work, we still need to define a configuration for Debezium. The configuration file is named application.properties
debezium.source.connector.class=io.debezium.connector.postgresql.PostgresConnector
debezium.source.offset.storage.file.filename=data/offsets.dat
debezium.source.offset.flush.interval.ms=0
debezium.source.database.hostname=postgres
debezium.source.database.port=5432
debezium.source.database.user=glassflowuser
debezium.source.database.password=glassflow
debezium.source.database.dbname=glassflowdb
debezium.source.topic.prefix=glassflowtopic
debezium.source.plugin.name=pgoutput
debezium.sink.type=nats-jetstream
debezium.sink.nats-jetstream.url=nats://nats:4222
debezium.sink.nats-jetstream.create-stream=true
debezium.sink.nats-jetstream.subjects=postgres.*.*
It is important to note that the source connector is set up for PostgreSQL using the default Debezium decoderbufs plugin. In this article, we don’t use decoderbufs so we need to set debezium.source.plugin.name=pgoutput.
How Debezium Achieve CDC with PostgreSQL
Before we move on, let’s talk about how Debezium achieved CDC with PostgreSQL.
Debezium connects to PostgreSQL as a replication client. This is facilitated by setting up a Debezium connector for PostgreSQL, which requires PostgreSQL to be configured with wal_level set to logical.
Upon setting up, Debezium creates a logical replication slot in PostgreSQL. This slot ensures that WAL entries relevant to Debezium are retained until processed, preventing data loss even if the Debezium connector temporarily goes offline.
Debezium reads the changes from the WAL via the replication slot. It decodes these changes from their binary format to a structured format (for example, JSON) representing the SQL operations.
Each decoded change is then emitted as a separate event. These events include all the information necessary to understand what change occurred in the database, such as the type of operation (INSERT, UPDATE, DELETE), the affected table, and the old and new values of the modified rows.
Debezium acts as NATS producer. Each change event is published to a NATS topic (usually one topic per table).
Consumers can subscribe to these NATS topics to receive real-time updates about database changes. This enables applications and microservices to react to data changes as they occur.
Testing Our Setup
If we set everything correctly when we perform actions on our PostgreSQL, we can see that any changes to the accounts table will be sent to NATS.
$ psql -h 127.0.0.1 -U glassflowuser -d glassflowdb
glassflowdb=# INSERT INTO "public"."accounts" ("username", "password", "email", "created_on")
VALUES ('user2', 'beseeingya', 'user2@email.com', NOW());
glassflowdb=# DELETE FROM accounts WHERE username = 'user3';
When we create a consumer, we can see all the events sent by Debezium.
$ nats consumer add DebeziumStream viewer --ephemeral --pull --defaults > /dev/null
$ nats consumer next --raw --count 100 DebeziumStream viewer | jq -r '.payload'
{
"before": null,
"after": {
"user_id": 4,
"username": "user2",
"password": "beseeingya",
"email": "user2@email.com",
"created_on": 1700505308855573,
"last_login": null
},
"source": {
"version": "2.2.0.Alpha3",
"connector": "postgresql",
"name": "glassflowtopic",
"ts_ms": 1700505308860,
"snapshot": "false",
"db": "glassflowdb",
"sequence": "[\"26589096\",\"26597648\"]",
"schema": "public",
"table": "accounts",
"txId": 742,
"lsn": 26597648,
"xmin": null
},
"op": "c",
"ts_ms": 1700505309220,
"transaction": null
}
{
"before": {
"user_id": 3,
"username": "",
"password": "",
"email": "",
"created_on": 0,
"last_login": null
},
"after": null,
"source": {
"version": "2.2.0.Alpha3",
"connector": "postgresql",
"name": "glassflowtopic",
"ts_ms": 1700505331733,
"snapshot": "false",
"db": "glassflowdb",
"sequence": "[\"26598656\",\"26598712\"]",
"schema": "public",
"table": "accounts",
"txId": 743,
"lsn": 26598712,
"xmin": null
},
"op": "d",
"ts_ms": 1700505331751,
"transaction": null
}
Why It Matters
Integrating Debezium with PostgreSQL and NATS for change data capture (CDC) is a fundamental building block in constructing advanced, real-time data pipelines. Once you have set this up, it opens up a plethora of possibilities for further data utilization and integration across various systems and applications. For example, the change events captured from the database can be seamlessly streamed to a data lake, enabling organizations to aggregate vast amounts of data in a centralized repository for complex analysis and machine learning purposes. These data streams can also be directly fed into analytics dashboards, providing real-time insights and decision-making capabilities. This can be particularly useful for monitoring key metrics, detecting anomalies, or understanding user behavior in near real-time. Furthermore, the system can be extended to trigger automated workflows in response to specific data changes, such as sending notifications or updating other systems. The flexibility and scalability of this setup make it an ideal foundation for building comprehensive and responsive data-driven ecosystems, catering to a wide range of use cases from business intelligence to operational monitoring.
Conclusions
Synchronizing data across microservices can be a complex task, but the combination of PostgreSQL, Debezium, and Nats offers a robust solution. This setup ensures real-time data consistency across services and maintains the principles of microservices architecture. By leveraging these technologies, we can build scalable, resilient, and efficient systems that meet the demands of modern application development.
Remember, this guide is a starting point. Depending on your specific requirements, further customization and configuration might be necessary.