UPDATE: On July 8 at 3 pm BST/4 pm CEST, our data platform experts and Snowflake are teaming up for a webinar. Learn how modern data architecture can transform ordinary trips into extraordinary ones with valuable business insights. Augusto Elesbão, senior data engineer at GetYourGuide; Robert Bemmann, data engineer; and Snowflake director of product management Torsten Grabs. Learn more and register.
Thiago Rigo, senior data engineer, and David Mariassy, data engineer, built a modern ETL pipeline from scratch using Debezium, Kafka, Spark and Airflow.
In the first of this two-part series, Thiago walks us through our new and legacy ETL pipeline, overall architecture and gives us an overview of our extraction layer. This article is an adaptation of the presentation at Spark & AI, a summit organized by Data Bricks.
The GetYourGuide stack was previously made of two main parts: Pentaho and Postgres. Both ETL tools have worked really well for us for over the past six years, but as data volume grows, and business requirements change, we’ve decided it was time to move to a more modern architecture since our current ETL pipeline was starting to show some problems.
At a high level, we have these two main steps which are executed as part of an Airflow DAG:
Here’s an example: During the extraction phase, a table called booking is loaded from source database resulting in a copy of this table on our data lake, and then one transformation called fact_booking uses that, and many other tables, to create a complete picture of what happens to bookings on our platform.
Now that you have a basic understanding of the overall architecture, let’s dig a bit deeper into the extraction layer. Here’s what it looks like:
Let’s first go through what each component does individually, and then tie everything together at the end.
Everything starts with the CDC pipeline. Currently we have two ways of ingesting data from source databases, Debezium or JDBC. Although we’re moving away from JDBC in favor of Debezium, we’ll highlight both scenarios in this article.
What’s important to notice, is at the CDC pipeline, we should have several Parquet files which represent the change logs from source databases. These files may contain several records for the same PK, and that’s why we have upsert after that to compact these records.
JDBC Loader task is triggered via Airflow for a specific table, connects to the source database, reads in data that has been updated, or all data for small tables, then writes out Parquet files.
As we mentioned before, Debezium is constantly reading the databases’ event log, and publishing that to Kafka. At the other end of these topics, we have another Kafka Connector Sink which writes out Avro files to S3. Then Avro Converter picks up these files, and converts them to Parquet.
Notice: If there’s an inconsistent type change on any field, Avro Converter or JDBC Loader will communicate with Schema Service, and drop that field specifically so that we don’t end up with schema problems on the final table.
After CDC pipeline is completed, Upsert is the next component executed. The goal of this step is to generate for each source table another table in the data lake under the db_mirror schema. So for source table booking, you’ll have another table called db_mirror.booking.
For big tables, the upsert component also tries to optimize the FULL OUTER JOIN operation by selecting only the partitions that have been affected by the incoming change logs.
At this point we have finished extracting data from source databases by using either Debezium or JDBC Loader, and are now ready to dive into part 2 for the transformations:
Are you interested in joining our engineering team? Check out our open positions.
If you found this post helpful, sign up for the Data Platform Webinar! On July 8 at 3 pm BST/4 pm CEST, our data platform experts and Snowflake are teaming up for a webinar. Learn how modern data architecture can transform ordinary trips into extraordinary ones with valuable business insights. Augusto Elesbão, senior data engineer at GetYourGuide; Robert Bemmann, data engineer; and Snowflake director of product management Torsten Grabs. Learn more and register.
How we Leverage Postgres for our Search Data Processing Pipeline at GetYourGuide
The Long and Winding Road to Short and Smooth Releases