Engineering

BoxOffice Migration

Discover how GetYourGuide successfully migrated millions of data objects between microservices without downtime. Our engineering team shares their strategy for seamless online migrations, including dual writing, incremental updates, and extensive monitoring. Learn best practices for executing large-scale migrations while maintaining data consistency and minimizing impact on customer experience.

Ayush Kumar

Senior Software Engineer

Key takeaways:

Ayush Kumar, Davor Kapac, Jefimija Zivkovic, and Hinrik Sigurðsson from our engineering team share their experience and lessons learned from a major migration project - moving their BoxOffice service into the Inventory service. The migration involved transitioning millions of active data objects and refactoring thousands of lines of code while maintaining uninterrupted service for GetYourGuide’s customers. In this blog, they explain their transparent, observable, and incremental approach that ensured a smooth transition.

Engineering teams face a common hurdle when building software: they eventually need to redesign the data models and refactor services they use to support clean abstractions and complex features. In production environments, this might mean migrating millions of active data objects and refactoring thousands of lines of code.

Millions of customers book on GetYourGuide every day, expecting an uninterrupted and immersive traveling experience. Behind the scenes, a myriad of systems and services orchestrate the product experience. Our internal sales teams regularly manage the products using our inventory platform. These backend systems consistently evolve and optimize to meet and exceed customer and product team expectations.

The backend for our travel product utilizes a highly distributed microservices architecture; hence, these migrations also happen at different points of the service call graph. It can happen on an edge API system servicing customer devices, between the edge and mid-tier services, or from internal teams to manage the products. Another relevant factor is that the migration could be happening on APIs that are stateless and idempotent, or it could be happening on stateful APIs.

One of the main challenges when undertaking system migrations is establishing confidence and seamlessly transitioning the traffic to the upgraded architecture without adversely impacting the customer experience. This blog series will examine the tools, techniques, and strategies we have utilized to achieve this goal.

Why are migrations hard?

Scale

GetYourGuide has hundreds of millions of ticket and order objects. Running a large migration that touches all those objects is a lot of work for our production database.

Imagine that it takes one second to migrate each ticket and order object; sequentially, it would take over three years to migrate one hundred million objects.

Uptime

Travelers constantly book on GetYourGuide while sales upload tickets. We perform all infrastructure upgrades online rather than relying on planned maintenance windows. Because we couldn’t simply pause the booking service during migrations, we had to execute the transition with all our services operating at 100%.

Accuracy

Our Order table is used in many different places in our codebase. If we tried to simultaneously change thousands of lines of code across the service, we would overlook some edge cases. We needed to be sure that every service could continue to rely on accurate data.

A pattern for online migrations

Moving millions of objects from one database table to another is difficult, but many companies need to do it.

People often use a typical four-step dual-writing pattern to do large online migrations like this. Here’s how it works:

Dual writing to the existing and new tables to keep them in sync.
Changing all read paths in our codebase to read from the new table.
Changing all write paths in our codebase to only write to the new table.
Removing endpoints to access old service.

Our migration services - BoxOffice to Inventory

BoxOffice is our internal service that processes the order flow, including adding to a cart, booking a tour, canceling, and modifying orders. It also manages the pre-bought ticket flow in our inventory. Inventory is the service that manages all the availability and prices for our tours and enables suppliers to add this availability.

We wanted to move the BoxOffice service into inventory to migrate the old PHP codebase to a new Java service and keep the order flow in addition to the availability flow.

As this was a massive technical migration - we kept the fundamental data model the same but made minor tweaks in the database design to enhance scalability and performance.

As a reminder, our four migration phases were:

Dual writing to the existing and new tables to keep them in sync.
Changing all read paths in our codebase to read from the new table.
Changing all write paths in our codebase to only write to the new table.
Removing old data that relies on the outdated data model.

Let’s walk through what these four phases looked like for us in practice.

Part 1: Dual writing

We begin the migration by creating a new database. The first step is to backfill the latest database from the old database, and the second step is to start duplicating new information so that it’s written for both databases.

We implemented dual writing using two techniques.

We used Debezium streams to replicate data from one database to another using Kafka consumers. Whenever an entity is created, updated, or deleted in the old database, the change is pushed to a Kafka topic through MySQL Debezium streams. The Kafka consumer in the new service consumes this update to apply it to the latest database and vice versa.
Odd and even primary keys - New entities with odd keys will be created in one service and even keys in another service.

The critical point to note here is that we need to ensure that the loop of database updates has a short circuit, meaning the only required database updates are consumed.

The technique was to short-circuit the above.

To decide which update to apply, we compared the update_timestamp in the Debezium stream with the update_timestamp in the database entity.

Note: We ensured update_timestamp in the database was up to 6 decimal places to follow Debezium’s timestamp convention

Part 2: Changing all read paths

Data consistency checks were run initially to verify the new service has accurate data.

Now that the old and new databases are in sync, we will use the new service to read all our data.

Once the dual writing was in production, we started migrating all read endpoints using a feature toggle between the new and old services. We first performed rounds of manual testing for a few data points to verify that the new service was accurate and then gradually rolled out the traffic to the new service.

We used an endpoint-by-endpoint approach, starting from less critical flows and moving to more critical flows using proper monitoring. The following metrics were used to verify new paths:

5XX and 4XX errors
Debug and error logs to check in case of errors
Proper communication with stakeholders about the migration

Using the feature toggle helped us stabilize the system faster in case of errors.

Part 3: Changing all write paths

Next, we need to update the write paths to use our new service. Since we aim to roll out these changes incrementally, we’ll need to employ careful tactics.

Up until now, we’ve been writing data to the old database and then copying them to the new database.

We now want to reverse the order: write data to the new database and then sync it in the old database. Keeping these two stores consistent allows us to make incremental updates and observe each change.

Reduplicating all code paths where we handle booking data is arguably the most challenging part of the migration because the old service was years old, with no documentation and multiple business edge cases. GetYourGuide’s logic for handling booking operations (e.g., reserve, book, cancel, modify) spans thousands of lines of code across multiple classes.

The key to a successful refactor will be our gradual rollout and extensive testing; we’ll isolate as many code paths into the smallest unit possible to apply each change carefully. Our two tables need to stay consistent with each other at every step.

For each code path, we decided to roll out at the tour and supplier level - we started with rolling out one endpoint for a few tours(priority based on net revenue) and checked the following:

5xx and 4xx errors
Data in complementary tables

Gradually, we added more tours and suppliers for over a week before rolling out 100%.

Part 4: Removing old endpoints

Our final (and most satisfying) step is to remove code that writes to the old database and eventually delete the old service.

Once we’ve determined that no more code relies on the BoxOffice service and database, we no longer need to write to the old table. With this change, our code no longer uses the old service, and the new service now becomes our source of truth.

We can now remove the BoxOffice service calls from all the other services and incrementally process towards removing the code. We first removed all the code using the old service in different clients and then cleaned up the feature toggles in the new service and Kafka consumers to remove the syncing process of updating data back in the old service.

Once we were certain that all usages of the old service had been removed—by checking metrics of calls to the old service—we retired the old service by pruning all infrastructure resources and deleting the repository.

Conclusion

Running migrations while keeping the new service APIs consistent is complicated. Here’s what helped us run this migration safely:

We laid out a gradual rollout migration strategy to allow us to transition data stores while operating our production services without downtime.
We implemented a database sync between the two services to ensure consistency.
All the changes we made were incremental. We never attempted to roll out more than one endpoint.
All our changes were highly transparent and observable. Proper monitoring alerted us as soon as a single piece of feature was inconsistent in production. At each step, we gained confidence in our safe migration.

This approach has influenced the many online migrations we’ve executed at GetYourGuide. We hope these practices prove helpful for other teams performing migrations at scale.

‍