Cant find the role you’re looking for?
Sign up to be the first to know about future job openings at GetYourGuide
Stay notified of new jobsCopyrighted 2008 – 2023 GetYourGuide. Made in Zurich & Berlin.
Dharin Shah, Senior Software Engineer, shares how we ramped up search functionality by transitioning from a Kafka Streams-based system to a PostgreSQL + Kafka architecture—enhancing scalability, efficiency, and cost-effectiveness.
{{divider}}
In the ever-evolving tech world, adaptability and innovation are paramount. At GetYourGuide, we embarked on a transformative journey, transitioning from a Kafka Streams-based system to PostgreSQL + Kafka for our data processing needs. The biggest reason for taking this journey was a scalability problem with our previous solution. As the company was growing—and still is!—we weren’t confident the old system would scale as well.
Before diving into the crux of technological problems, Let’s first understand why this data processing is essential before tackling the core technical issues. The search function in GetYourGuide reflects the global state, supporting more than 22 languages worldwide—with ongoing expansion to additional languages. It offers an API with versatile options for filtering and aggregating data. The search feature generates the initial content you see on our website. However, each component of this content is sourced from services external to our search system.
This core requirement led the Search team to build data processing to combine various data sources into a view of the state of the GetYourGuide world. It's important to highlight that by having our APIs as read-only for clients and adopting a denormalized document data model in OpenSearch, we aimed to optimize for read operations. This approach allowed us to asynchronously establish relationships in our processing pipeline, enhancing read efficiency by reducing the need for complex joins or queries. As you see in the screenshot above—and you can also visit getyourguide.com to explore further—the first page shows the data pulled from the search APIs. And almost all pages in GetYourGuide show data served via Search APIs. Now, let’s dive into the technical challenge.
To build the data served via APIs, we must process it from various sources, combining and mapping it to the internal search world. To achieve that, we built our initial system on Kafka Streams. It was designed to process activity data, which was then denormalized and saved in OpenSearch to serve via the API. This data was a complex amalgamation of various elements like locations, categories, rankings, reviews, and more. To combine all this data, we heavily relied on Kafka Streams Ktable joins.
As you can see in the image above, the complexity is not simplified even with visualization.
As our data grew and our requirements became intricate, we encountered several challenges:
Every time we added or changed code, we had to reprocess all the data from the beginning. This caused more updates and put extra strain on Kafka brokers and OpenSearch—our search API DB. It is possible to add various data types and make the system more adaptable using Kafka Streams Processor API, which offers more control over data flow. However, this makes the code more complex and doesn’t solve the issue of scaling.
Essentially, we had to send all our processed data back to Kafka to recover it if we restarted. This was because we didn’t have Persistent Volumes in our Kubernetes cluster.
The realization dawned on us that a paradigm shift was essential to address these challenges. We decided to transition our data processing from Kafka Streams to PostgreSQL. This decision was influenced by several factors:
The transition was not without its challenges. We had to redesign our data processing flow, keeping in mind our previous system’s limitations and PostgreSQL’s capabilities.
Our new data processing flow was designed to be efficient, simple, scalable, and easy to manage:
One of the significant positive changes was the introduction of a custom PostgreSQL function to merge JSON data. This simplified our merging strategy:
This simple function allowed us to merge data from various sources efficiently without the need for complex Kafka Streams operations. It also meant that the database merged data without “actually” integrating any business logic with the data.
All these JSONs are saved as individual pieces, and the merge step combines them based on some conditions that are coded in the SQLs and merged using the JSON function above.
Simplified representation of complex SQL for Data Aggregation:
Another important piece was the introduction of buffering logic in our new architecture. This buffering logic was designed to handle notification updates from Kafka efficiently—check out the code here.
This buffering logic allowed us to process and merge data efficiently, ensuring our system was fast and reliable.
The impact of our transition was profound:
This image shows the amount of storage shaved off as the storage for our usage increased. The metric shows a decline when we migrated and cleaned up gradually.
Furthermore, we have extended this architecture to other entities in our search system, making our entire data-processing ecosystem more robust and efficient. As a penultimate remark, we tied most of our operations to the database compute—i.e., letting the database do all the work—and services only bare the final result of all the intermediate operations. That effectively helped save quite a lot of Network IO and increased performance benefits with reduced cost.
Our journey from Kafka Streams to stateless Kafka and PostgreSQL was filled with challenges but also a journey of learning, innovation, and growth. Today, our data processing system is more efficient, scalable, and cost-effective. We are better positioned to handle our future data processing needs and continue to deliver value.
Viktoriia Kucherenko for supporting through the project and initiation that led to prioritizing the project.