Theodore Meynard, Data Science Manager at GetYourGuide in Berlin, discusses transitioning the activity ranking team from batch to real-time ranking, highlighting challenges, strategies, and innovations.
In this article, we will outline our transition from a traditional batch-processing system to real-time machine learning. We will dive into the challenges, strategies, and innovations that marked our transition. Dive in to discover how we enhanced our activity ranking system, ensuring our customers are always matched with the most unforgettable travel experiences.
Previously, our system relied on daily batch scoring of every activity in our inventory, with our search service then ranking the activities based on these scores. While this approach was simple to maintain, it had significant drawbacks. It was challenging to personalize the ranking, and it required dependency on other teams when we wanted to test a new scoring or a new logic. Every change ended up taking multiple weeks or even months to be implemented.
To address these issues, we envisioned ML models that could perform real-time inferences using dynamic signals living in a service owned by the team. But to mitigate risks, we didn't jump straight into developing this service containing an ML model. Instead, we broke down the migration into multiple incremental steps, each adding value and allowing us to learn along the way.
We migrated our ranking logic to its own service. This service, initially just doing a lookup on the score and ordering based on that, proved the feasibility of having such a service and increased our speed of launching new improvements experiments. We were able to launch in days instead of weeks.
We began to segment our ranking depending on user signals, such as the type of platform or the language used on our website. This step significantly improved our ranking (proven in an A/B test), reinforcing our hypothesis that ranking needed to be contextualized and giving us direction of which segments were meaningful to incorporate.
We trained a model to rank activities in real time. We reused the segment scores we had created in the previous step to find the optimal segment combination based on past data. This step stabilized our model in production, helping us learn how to operate an ML model in real time and generate data to train it.
We improved the model by adding relevant features from users and the activity for ranking. This combined with fine-tuning the model, led to a significant improvement in our ranking.
Operating such a service brought its own set of challenges. As our website and apps depend heavily on the ranking on most of our pages, high availability is crucial. We also had to maintain low latency and high development velocity while ensuring explicit ownership, full workflow in the continuous integration (CI), and end-to-end (E2E) tests with real data.
To overcome these challenges, we made key design decisions:
We used MLflow, an open-source library to manage ML model lifecycles, to clearly define the interface between our Data Scientists and MLOps engineers.
Data Scientist's Role:
MLOps Engineer's Role:
Validating our ML pipeline was a challenge we grappled with for a while. The intricate interplay between code and data in ML pipelines makes it difficult to separate and test them in isolation. Initially, we considered using dummy data for validation, but this approach had several drawbacks:
Given these challenges, we concluded that the most reliable way to test our pipeline was by using production data. However, we needed to sample this data to expedite the testing process.
To allow us to do such a thing, we developed and open-sourced DDataflow, a tool designed to simplify data sampling for end-to-end tests within the Continuous Integration (CI) framework. If you're curious, explore DDataflow on pypi.
To further ensure the reliability of our data and models, we introduced health checks or expectations.
These checks serve as a validation mechanism:
Dataset Expectations: For instance, we expect the dataset to maintain the share of booked activity within a certain range.
Model Expectations: Our models should consistently achieve an accuracy above a specified threshold.
These health checks are run as part of our end-to-end testing, combined with the data sampling provided by DDataflow. Given the increased noise from sampling, we've set more lenient requirements for these checks.
We automated every step of our testing and deployment process in our CI. This automation helped remove room for manual error and simplify development speed.
Here's a step-by-step breakdown of our automated workflow:
Our automation extends to daily operations with airflow ensuring that our machine learning models are always up-to-date and reliable:
By automating every step, we've significantly reduced the potential for manual errors and streamlined our development process. This not only accelerates our development speed but also ensures consistent quality and reliability across all our ML operations.
Our journey was full of surprises and spanned multiple quarters, with the collective effort of our dedicated team members of the activity ranking team and also everyone who helped build our Machine learning platform.
The key to our success was the incremental upgrade of our data product, explicit ownership, the use of production data for testing, and the automation of our workflows. As we look ahead, we're excited about the improvements we've made in our ranking system. We're currently working on personalization and integrating additional relevant signals to further enhance our ranking capabilities.
If you have any questions or are interested in joining our dynamic team, check out our open roles or learn about the growth path for engineers at GetYourGuide. We're hiring!
How Sequential Testing Accelerates Our Experimentation Velocity