[placeholder]

Leveraging Change Data Capture For Database Migrations At Scale

Leveraging Change Data Capture For Database Migrations At Scale

PostgreSQL is awesome, and here at Squarespace, we love it for the stable foundation it provides for ACID (Atomicity, Consistency, Isolation, and Durability) compliant, business-critical workloads. However, it doesn’t scale horizontally – at least not completely –  and for several of our use cases, we had already pushed vertical scaling to its limits. At the same time, as our systems and ambitions grew, scheduling downtime for maintenance – or preparing for disaster recovery and other unplanned outages – started to impose a tax in the form of developer toil and business impact, both of which had become increasingly unacceptable.

So we started exploring distributed databases, and CockroachDB (CRDB) stood out. It speaks Postgres (mostly), seamlessly scales both reads and writes horizontally across low-cost machines, and promises to keep our data alive and healthy regardless of which node decides to call it a day. Most importantly, its compatibility with our existing architecture gave us the confidence we needed to adopt it.

Perfect! Now we just had to move a whole lot of databases over to use it. No big deal, right?

The High-Level Plan

These databases supported a wide range of applications, each with their own usage patterns and requirements. Some were write-heavy, many were read-heavy, and quite a few were highly sensitive to read and/or write latencies. Ownership of these databases was spread across various platform and product teams, each with differing levels of urgency around adopting a distributed database. In short, a very typical scenario for a thriving, large-scale organization.

It became clear that the onus of migration had to be with our Databases team, with little to no lift on our application teams. A high-level plan evolved. We, Databases, would:

  1. Generally evaluate the fit and compatibility for each teams’ needs and patterns

  2. Migrate their schemas, users, and data for each database

  3. Validate that everything is in order

  4. Cut over traffic

  5. Pray (optional!)

We determined that the schema and user migration would require an in-house tool to address the nuances of CRDB, like column type incompatibilities, as we traversed the databases. We started with the most obvious path which would be a flavor of pg_dump and pg_restore, migration tools native to PostgreSQL. But, it quickly became clear this wouldn’t cut it. Those tools are fine for one-off exports, but they can’t handle ongoing changes. A hiccup mid-process could leave us in a worse place than where we started. We had a better experience with MOLT Fetch, CockroachDB’s continuous replication tool, but it didn’t support all of our requirements at the time. The goal was to get everything synced up while Postgres stayed live, and only redirect traffic once we were 100% confident.

So, we didn’t just need a migration tool – we needed a safety net. If anything went wrong with CRDB, we had to be able to roll back fast, without service disruption.

Streaming All The Things

We leaned into Postgres’ change data capture (CDC) capabilities, opting to replicate data logically in near real time.

This meant starting with its write-ahead log (WAL). By default, Postgres is always pre-recording changes as binary messages before they’re sent to disk.  We used pgoutput, Postgres' logical decoding plugin that bridges the WAL with the logical streaming protocol to get a stream of database changes (inserts, updates, and deletes) efficiently. From there, we used Debezium – an open-source CDC platform that plugs into Kafka Connect – to watch each table and funnel changes into Kafka topics in Avro format.

The setup was neat: one Debezium connector per table, each starting with a snapshot – a lock-safe transaction for reading from the table – and continuing from the exact log sequence number (LSN) where it left off. The database served production traffic the whole time. No write downtime. We saw snapshot speeds hitting 60k rows per second, though mileage varied depending on the database’s shape and load.

Figure 1 – The layers involved in capturing changes from Postgres

Figure 1 – The layers involved in capturing changes from Postgres

The data, once in Kafka, needed to be written to CockroachDB efficiently. We wrote a custom consumer using Apache Beam to process the events. Inserts and updates became upserts, deletes were passed through untouched, and all writes were batched whenever possible for performance. We built subtle data transformations directly into this step when necessary for compatibility.

Each table had its own Apache Beam pipeline, pulling from its Kafka topic and pushing into CRDB. Data type inference happened automatically by reading from CRDB and pairing that with the structured Avro messages. Thanks to Debezium’s expressive event encoding, decoding data types for CRDB was painless.

The pipelines were transactional, checkpointed, and backpressured by design – meaning no event moved forward until we were sure it had landed safely in CockroachDB.

We deployed all of this on Google Cloud’s Dataflow Runner. Its observability and reliability made it a perfect fit for our streaming workloads. With Terraform-managed Flex templates and custom metrics to track freshness, we had visibility and control like never before.

Cutover and Safety Nets

But, we did mention needing a safety net! To get this benefit, a fallback pipeline was put together to mirror the process back to Postgres. A custom changefeed consumer captured the events from CRDB and held them in a separate Kafka queue. These could then be easily written back to the Postgres database in a worst-case scenario.

Figure 2 – The entire pipeline, complete with a fallback mechanism

Figure 2 – The entire pipeline, complete with a fallback mechanism

Once we were confident the data was synced and validated – thanks to a homegrown data diffing tool – we flipped the switch. Literally. Our proxy server was redirected to CRDB, and our apps didn’t even blink.

For most systems, the data was >99.99% fresh by the time of their cutover. That level of precision meant we could migrate these systems without any noticeable write downtime. But, some edge cases required coordinated write or read pauses – especially when applications relied on immediate consistency and couldn’t tolerate even the smallest stale read. Other times, high-write-volume systems would require a brief write freeze just to let the pipeline catch up. In these cases, human intervention made these migrations go smoothly, and overplanning with extra automation would have extended our project delivery time, perhaps unnecessarily.

Referential Integrity and Other Lessons

One challenge we hit hard was referential integrity. Our system had one writer per table, and these writers operated independently. If a table referenced another one through a foreign key, and the other writer had not caught up yet, the database write would fail.

We needed a generic solution here. Our first strategy was to persistently retry and see if it would stick eventually. This worked very well for low volumes but write throughput took a hit when the whole snapshot of a table was being migrated. This approach also failed in scenarios where two or more tables had cyclical referential constraints. Surprising? Maybe, but Postgres actually allows temporarily inconsistent data within a transaction. CRDB, on the other hand, doesn't support that yet.

We considered combining writers for related tables, but older versions of Postgres didn’t give us the ordering guarantees we needed. In the end, we turned off constraints temporarily – a seemingly risky move, as constraints provide data integrity at the cost of performance, but safe in our case because the migration process ran in isolation: no reads, no writes from the outside, and they were put back in place just before the final cutover.

What’s Next

This wasn’t our first dance with Change Data Capture – we’ve used it for cache invalidation, analytics, and more, but this project reaffirmed its power. With the combination of stream processing, CDC, and proven solutions for challenges like performance, we’ve opened new doors for personalization, real-time analytics, and operational resilience for enterprise-scale data migrations.

CockroachDB delivered the scale and reliability we needed to our product and platform teams. But as software engineers  in the Database space, the real satisfaction came from building a system that embraced the process, managed failures gracefully, and restored a sense of control.

And in the end, that’s what every good data story is really about.

Unfold's Modern Mobile Release Process and the Subtle Art of Making Them Boring

Unfold's Modern Mobile Release Process and the Subtle Art of Making Them Boring