Cant find the role you’re looking for?
Sign up to be the first to know about future job openings at GetYourGuide
Stay notified of new jobsCopyrighted 2008 – 2023 GetYourGuide. Made in Zurich & Berlin.
Ayush Kumar, Davor Kapac, Jefimija Zivkovic, and Hinrik Sigurðsson from our engineering team share their experience and lessons learned from a major migration project - moving their BoxOffice service into the Inventory service. The migration involved transitioning millions of active data objects and refactoring thousands of lines of code while maintaining uninterrupted service for GetYourGuide’s customers. In this blog, they explain their transparent, observable, and incremental approach that ensured a smooth transition.
{{divider}}
Engineering teams face a common hurdle when building software: they eventually need to redesign the data models and refactor services they use to support clean abstractions and complex features. In production environments, this might mean migrating millions of active data objects and refactoring thousands of lines of code.
Millions of customers book on GetYourGuide every day, expecting an uninterrupted and immersive traveling experience. Behind the scenes, a myriad of systems and services orchestrate the product experience. Our internal sales teams regularly manage the products using our inventory platform. These backend systems consistently evolve and optimize to meet and exceed customer and product team expectations.
The backend for our travel product utilizes a highly distributed microservices architecture; hence, these migrations also happen at different points of the service call graph. It can happen on an edge API system servicing customer devices, between the edge and mid-tier services, or from internal teams to manage the products. Another relevant factor is that the migration could be happening on APIs that are stateless and idempotent, or it could be happening on stateful APIs.
One of the main challenges when undertaking system migrations is establishing confidence and seamlessly transitioning the traffic to the upgraded architecture without adversely impacting the customer experience. This blog series will examine the tools, techniques, and strategies we have utilized to achieve this goal.
GetYourGuide has hundreds of millions of ticket and order objects. Running a large migration that touches all those objects is a lot of work for our production database.
Imagine that it takes one second to migrate each ticket and order object; sequentially, it would take over three years to migrate one hundred million objects.
Travelers constantly book on GetYourGuide while sales upload tickets. We perform all infrastructure upgrades online rather than relying on planned maintenance windows. Because we couldn’t simply pause the booking service during migrations, we had to execute the transition with all our services operating at 100%.
Our Order table is used in many different places in our codebase. If we tried to simultaneously change thousands of lines of code across the service, we would overlook some edge cases. We needed to be sure that every service could continue to rely on accurate data.
Moving millions of objects from one database table to another is difficult, but many companies need to do it.
People often use a typical four-step dual-writing pattern to do large online migrations like this. Here’s how it works:
BoxOffice is our internal service that processes the order flow, including adding to a cart, booking a tour, canceling, and modifying orders. It also manages the pre-bought ticket flow in our inventory. Inventory is the service that manages all the availability and prices for our tours and enables suppliers to add this availability.
We wanted to move the BoxOffice service into inventory to migrate the old PHP codebase to a new Java service and keep the order flow in addition to the availability flow.
As this was a massive technical migration - we kept the fundamental data model the same but made minor tweaks in the database design to enhance scalability and performance.
As a reminder, our four migration phases were:
Let’s walk through what these four phases looked like for us in practice.
We begin the migration by creating a new database. The first step is to backfill the latest database from the old database, and the second step is to start duplicating new information so that it’s written for both databases.
We implemented dual writing using two techniques.
The critical point to note here is that we need to ensure that the loop of database updates has a short circuit, meaning the only required database updates are consumed.
The technique was to short-circuit the above.
To decide which update to apply, we compared the update_timestamp in the Debezium stream with the update_timestamp in the database entity.
Note: We ensured update_timestamp in the database was up to 6 decimal places to follow Debezium’s timestamp convention
Data consistency checks were run initially to verify the new service has accurate data.
Now that the old and new databases are in sync, we will use the new service to read all our data.
Once the dual writing was in production, we started migrating all read endpoints using a feature toggle between the new and old services. We first performed rounds of manual testing for a few data points to verify that the new service was accurate and then gradually rolled out the traffic to the new service.
We used an endpoint-by-endpoint approach, starting from less critical flows and moving to more critical flows using proper monitoring. The following metrics were used to verify new paths:
Using the feature toggle helped us stabilize the system faster in case of errors.
Next, we need to update the write paths to use our new service. Since we aim to roll out these changes incrementally, we’ll need to employ careful tactics.
Up until now, we’ve been writing data to the old database and then copying them to the new database.
We now want to reverse the order: write data to the new database and then sync it in the old database. Keeping these two stores consistent allows us to make incremental updates and observe each change.
Reduplicating all code paths where we handle booking data is arguably the most challenging part of the migration because the old service was years old, with no documentation and multiple business edge cases. GetYourGuide’s logic for handling booking operations (e.g., reserve, book, cancel, modify) spans thousands of lines of code across multiple classes.
The key to a successful refactor will be our gradual rollout and extensive testing; we’ll isolate as many code paths into the smallest unit possible to apply each change carefully. Our two tables need to stay consistent with each other at every step.
For each code path, we decided to roll out at the tour and supplier level - we started with rolling out one endpoint for a few tours(priority based on net revenue) and checked the following:
Gradually, we added more tours and suppliers for over a week before rolling out 100%.
Our final (and most satisfying) step is to remove code that writes to the old database and eventually delete the old service.
Once we’ve determined that no more code relies on the BoxOffice service and database, we no longer need to write to the old table. With this change, our code no longer uses the old service, and the new service now becomes our source of truth.
We can now remove the BoxOffice service calls from all the other services and incrementally process towards removing the code. We first removed all the code using the old service in different clients and then cleaned up the feature toggles in the new service and Kafka consumers to remove the syncing process of updating data back in the old service.
Once we were certain that all usages of the old service had been removed—by checking metrics of calls to the old service—we retired the old service by pruning all infrastructure resources and deleting the repository.
Running migrations while keeping the new service APIs consistent is complicated. Here’s what helped us run this migration safely:
This approach has influenced the many online migrations we’ve executed at GetYourGuide. We hope these practices prove helpful for other teams performing migrations at scale.