Romina Jafaryanyazdi is an Associate Backend Engineer based in our Zurich office. She explains how the Marketing Platform team started using Kafka to run reports. From the challenges of multiple API calls to the importance of scalability, here are her takeaways.
I joined GetYourGuide in 2021, and work on the Marketing Platform team. One of our primary missions is to deliver data and reports from different sources and partners to internal stakeholders. This means downloading marketing data from external partners like Google and providing processed data for our internal data pipeline. This data is used by GetYourGuide’s data analysts, marketers, and data engineers to make business decisions.
Given the amount of data we are working with, as well as the need for a scalable approach, we needed to find a tool to manage and keep track of data and messages. After extensive discussions and tests, we decided on Kafka as the ideal solution for our use case. To download these reports we integrate the external API into our reporting pipeline.
The pipeline refers to the processing sequence of jobs. A job is sending download requests to the external API server, receiving the data back, and exporting it in different formats for our stakeholders.
We need to consider some constraints when we are defining these jobs. One of the main ones is that the job should be ‘stateless.’ In this context, being stateless means we should be able to rerun the job without worrying about affecting or corrupting the final data.
Depending on the GetYourGuide setting on the external partner, we could have different accounts from which we should download data. To accelerate this process, we would usually like to parallelize downloading data across accounts.
Sometimes, we need to download a big chunk of data, translating to many API calls.
There are some challenges in this process:
In all these scenarios, we need to keep track of the requests we have sent so far to avoid duplicate requests and continue with the rest.
The main challenge is how we should remember the successful API calls, considering that the job should remain stateless.
The simplest solution for having a stateless job and being safe in case of rerunning a job is to delete all the data from the previous run and rewrite everything. However, depending on the job, this solution could waste lots of resources and time as it will mean redoing some parts of the job again. We could save resources by simply continuing from where the previous run ends. As a result, we need to keep track of made requests to the external API.
On examining the challenge of remembering successful, stateless API calls we identified several potential strategies. However, each would necessitate additional steps making them far from efficient. Initial ideas included:
What we needed is a scalable, parallelizable solution that makes the minimum overhead for the job and is able to keep track of sending requests. The answer was Kafka.
Not only does Kafka address each of our requirements, but it also has the added advantage of observability in our pipeline.
We use Spark to push all our requests' content to a Kafka topic. Each message consists of one request. Then we implement a job using the Kafka Streams app to consume the data and send requests.
Kafka will keep track of ingested data, so the application does not need to worry about data deduplication or saving indexes.
Also, this solution enables us to scale our applications even more, using the same consumer group. The Kafka broker will take care of message distribution for applications within the same consumer group.
Additionally, this new solution gives us more observability in our pipeline. We can keep track of the Kafka topic lag and raise an alert if the lag is too much.
Furthermore, Kafka ensures that the job remains stateless. Consequently, if the job crashes for any reason, we can just rerun the job without worrying about any of the mentioned concerns.
For our use case, we explored different approaches, including saving indexes on files and using Dynamo to save successful requests. Kafka is the best option as it handles tracking and parallelism itself. Using Kafka helped us save time and infrastructure resources and run our reports more efficiently and smoothly.
How we Leverage Postgres for our Search Data Processing Pipeline at GetYourGuide
The Long and Winding Road to Short and Smooth Releases