Integrating Kafka into our Reporting Pipeline
Romina Jafaryanyazdi is an Associate Backend Engineer based in our Zurich office. She explains how the Marketing Platform team started using Kafka to run reports. From the challenges of multiple API calls to the importance of scalability, here are her takeaways.
I joined GetYourGuide in 2021, and work on the Marketing Platform team. One of our primary missions is to deliver data and reports from different sources and partners to internal stakeholders. This means downloading marketing data from external partners like Google and providing processed data for our internal data pipeline. This data is used by GetYourGuide’s data analysts, marketers, and data engineers to make business decisions.
Given the amount of data we are working with, as well as the need for a scalable approach, we needed to find a tool to manage and keep track of data and messages. After extensive discussions and tests, we decided on Kafka as the ideal solution for our use case. To download these reports we integrate the external API into our reporting pipeline.
The Pipeline: From Server to Stakeholder
The pipeline refers to the processing sequence of jobs. A job is sending download requests to the external API server, receiving the data back, and exporting it in different formats for our stakeholders.
We need to consider some constraints when we are defining these jobs. One of the main ones is that the job should be ‘stateless.’ In this context, being stateless means we should be able to rerun the job without worrying about affecting or corrupting the final data.
Depending on the GetYourGuide setting on the external partner, we could have different accounts from which we should download data. To accelerate this process, we would usually like to parallelize downloading data across accounts.
The Challenge: Keeping Jobs Running Efficiently, and Stateless
Sometimes, we need to download a big chunk of data, translating to many API calls.
There are some challenges in this process:
- Depending on the external server, we could have a rate limit enforced from the external API. As a result, it will take more time to go through all requests. Increasing running time could cause timing out of a job.
- We could receive unexpected responses from the external API server, potentially crashing the job. In these cases, we need to re-run the job.
- We want to parallelize the process even more; however, we need to add more logic to handle multi-threading.
In all these scenarios, we need to keep track of the requests we have sent so far to avoid duplicate requests and continue with the rest.
The main challenge is how we should remember the successful API calls, considering that the job should remain stateless.
Finding a Solution
The simplest solution for having a stateless job and being safe in case of rerunning a job is to delete all the data from the previous run and rewrite everything. However, depending on the job, this solution could waste lots of resources and time as it will mean redoing some parts of the job again. We could save resources by simply continuing from where the previous run ends. As a result, we need to keep track of made requests to the external API.
On examining the challenge of remembering successful, stateless API calls we identified several potential strategies. However, each would necessitate additional steps making them far from efficient. Initial ideas included:
Saving requests themselves:
- After each successful call, we can save the request. ‘Saving‘ could have different implementations. Saving on disk or memory or even using some other technologies such as Dynamo could have its own cost and potentially be time-consuming.
- In the case of running the job, before sending any request, we need to look up and check whether we have already sent this request. This extra step is called deduplication. A deduplication step will increase the running time even more and make the job more complex.
- We can only save the last index if the job uses some iterator to go through data and make requests. However, given that the number of accounts is variable, We need to save these indexes for each account which is not a scalable solution for our use case.
What we needed is a scalable, parallelizable solution that makes the minimum overhead for the job and is able to keep track of sending requests. The answer was Kafka.
Not only does Kafka address each of our requirements, but it also has the added advantage of observability in our pipeline.
We use Spark to push all our requests' content to a Kafka topic. Each message consists of one request. Then we implement a job using the Kafka Streams app to consume the data and send requests.
Kafka will keep track of ingested data, so the application does not need to worry about data deduplication or saving indexes.
Also, this solution enables us to scale our applications even more, using the same consumer group. The Kafka broker will take care of message distribution for applications within the same consumer group.
Additionally, this new solution gives us more observability in our pipeline. We can keep track of the Kafka topic lag and raise an alert if the lag is too much.
Furthermore, Kafka ensures that the job remains stateless. Consequently, if the job crashes for any reason, we can just rerun the job without worrying about any of the mentioned concerns.
For our use case, we explored different approaches, including saving indexes on files and using Dynamo to save successful requests. Kafka is the best option as it handles tracking and parallelism itself. Using Kafka helped us save time and infrastructure resources and run our reports more efficiently and smoothly.
Other articles from this series
The Road to an Engineering Career: Learning to Code at 27
Behind The Journey: Laurence Rega - Full Stack Engineer
How to Empower Engineers with Infrastructure as Code
How we find and fix OOM and memory leaks in Java Services
From Interviews to Onboarding: Insights From an Engineering Manager