Robert Bemmann, and Shaurya Sood, Data Engineers at GetYourGuide in Berlin, share the migration approach their team undertook to transition to Databricks' E2 platform. They detail the carefully planned steps taken to achieve a smooth and accurate migration process, guaranteeing minimal disruption to both the Data Platform's functionality and the continuity of day-to-day operations.
At GetYourGuide, the Core Data Platform team’s mission is crucial: enabling the company to leverage data autonomously to build innovative data products. Our entire spectrum of data operations - ranging from processing, exploration, and transformation, to the creation of machine learning and AI systems - is meticulously executed on Databricks.
Until Q1 2023, our Databricks infrastructure operated on a legacy workspace architecture, known as the single tenant (ST) workspace, established in May 2015. However, with the evolution of technology, this architecture is making way for the advanced next-generation Databricks platform - E2, as it approaches its end of life by the conclusion of 2023. E2 brings enhanced functionalities, including Unity Catalog, Delta Live Tables, Databricks SQL, and multi-workspace support, alongside the capability to operate LLMs.
In light of this shift, a substantial migration was necessary, involving the migration of over 2200 data pipelines, predominantly spark jobs, accommodating 10+ teams and in excess of 550 users.
This migration wasn’t limited to data pipelines alone; a spectrum of essential assets like notebooks, clusters, data, and machine learning models also made the journey.
A priority throughout this transition was to ensure continuity, aiming for no or minimal interruption in operations. This blog post will discuss the details of the migration process. We'll cover what was needed for a smooth transition, our planning for the new Databricks workspace, and the challenges we faced and how we solved them.
Before starting the planning process, we first collected the necessary requirements. Conversations were held with Databricks users from different areas of the platform to understand their crucial needs. Databricks at GetYourGuide is utilized by a wide range of teams, not limited to engineering. These include analytics, marketing, business, and product teams.
Our main goal was to have a smooth migration with little to no disruption in data operations. In the Data Platform team at GetYourGuide, all data transformation and processing are done using Spark on Databricks. We have strict SLAs to deliver business reports based on these processes, and it was important to meet these SLAs without any interruptions. Other important teams, like Data Products and Marketing Platform, also depend a lot on Databricks. So, it was necessary to make sure their operations continued without any problems during the migration.
A flexible migration strategy was crucial, allowing individual teams to migrate at their own pace. This meant facilitating a migration process where one team could transition immediately, while another could take several weeks to move their workloads and jobs comprehensively.
Every component of Databricks, such as notebooks, jobs, clusters, MLFlows, teams, users, secrets, and instance profiles, needs to be transferred and function as usual. The only noticeable modification for end-users will be the updated URL for the new workspace, ensuring the platform maintains its familiar operation and reliability.
In terms of desired workspace design, we had to come up with a design that serves our potential future needs and get the buy-in from all the teams and their respective point of contact.
The new E2 architecture inherently supports multi-workspace functionality. Hence, initially, for the workspace design we wanted to come up with a design that allows a future separation in terms of workflows, costs and ACLs. Our initial design was to have one workspace per team (LOB=line of business) as we planned for a clear division of users by groups to improve the overall governance of the Lakehouse. Each team would have its own catalog (with separate write access), but with read access to other catalogs. Fine-grained table access rights (such as blocking read access or masking for PII data) would be managed by Unity Catalog governance rules. The following visual outlines what we had in mind.
However, the above design was too complicated to migrate into immediately. Therefore, we decided to go with one single E2 workspace that would serve as a bridge workspace, so that we are able to migrate all assets to the E2 environment in a simplified manner. Our strategy was to apply a straightforward 'lift and shift' approach. We decided to replicate the existing workspace in its entirety, maintaining all of its established characteristics, into the new E2 account.
In reality, as of today, our workloads are still running on this single bridge workspace (but on E2 architecture) as we currently do not see a real benefit in overcomplicating things:
However, one challenge we still face is having unified ACLs on schemas and tables, for which we might plug in Unity Catalog in the future.
Looking ahead, we might consider further subdividing the workspace among various teams and perhaps creating distinct workspaces for different environments such as production and staging. But for now, the goal was to migrate the unified legacy workspace as it is, preserving its existing structure and functionality.
Since our goal was a phased migration, some spark jobs could run in the legacy workspace while others could operate in the E2 workspace. This approach posed a significant challenge: ensuring all datasets remain synchronized across both environments at all times.
Everything, from all S3 buckets and their respective files to all Managed and External Tables of the Hive metastore, needed to be consistently available and updated in both the old and new environment, ensuring jobs could read the same data irrespective of their operational workspace.
Based on our requirements, here are some of the questions that we asked ourselves:
Leveraging an External Hive Metastore as a common Satellite
Using an External Hive Metastore (EHM) helped us manage metadata and table structures across different workspaces more effectively. Let’s go through a practical example to understand its benefits better.
Imagine adding a new partition to a table in Workspace A. Without an EHM, if you try to access this table from Workspace B, the new partition won’t be visible. This happens because, without a shared metastore, updates made in one workspace don’t show up in another.
One way to handle this without an EHM is to create a custom application to keep the internal metastores of the two workspaces in sync. But this method would need a lot of maintenance overhead and constant near real-time updates, taking resources away from other important tasks.
With an EHM, the metadata updates are centralized. So, when a new partition is added to a table in Workspace A, the EHM gets updated. Then, when you access the table from Workspace B, the EHM ensures that the new partition is visible, keeping data consistent across workspaces.
In conclusion, using an EHM made the data management process more efficient, eliminating the need for a separate application to sync metadata across different workspaces.
How was the migration to an External Hive metastore done?
We employed a 4-hour maintenance window to pause all activities on Databricks to be able to facilitate this migration. Although the process is fairly quick and did not take longer than 30 minutes.
Moving on to our next challenge after migrating to EHM, we were tasked with the migration of over 21,000 managed tables, which amounted to about ~10TB of data. Even though the External Hive Metastore (EHM) solved our problem of keeping tables and partitions in sync, serving as a central source of truth, the migration of the managed tables presented its own unique set of challenges.
Why did we need to migrate the Managed tables?
For the managed tables, the files were localized to the legacy workspace, and consequently the data, were inaccessible from the E2 workspace. Although the EHM provided metadata access, querying the managed tables in E2 was futile as they appeared empty; the actual files resided in the DBFS of the legacy workspace.
To elaborate with an example, if a managed table in the legacy workspace was queried from the E2 workspace, it would show up as empty with zero records because the actual files containing the data were located in the local DBFS of the legacy workspace, making them inaccessible from the E2 workspace.
Thus, finding an efficient strategy to handle the migration of these substantial managed tables was crucial in our migration process.
Given the substantial volume of data to be migrated - 21,000 tables equating to around 10 TB - a strategic, prioritized approach was essential. We identified tables that were accessed most frequently to migrate them as a priority.
To determine this, we utilized various pieces of table metadata, such as creation dates and partition addition dates, available in the metastore. Additionally, we implemented a Spark query listener, which further informed us about the usage patterns of the tables, ensuring that our migration strategy was both efficient and focused on the most immediately necessary data. More on Spark Query Listener if you keep reading!
How did we migrate the data?
The migration of data was meticulously carried out in multiple phases and iterations. Here's a systematic breakdown of how it was executed:
Consequently, this universal approach ensured that both the legacy and E2 workspaces pointed to the same synchronized location in S3, and also had access to the same data - promoting consistency and alignment across the two workspaces.
Let’s talk about the Spark Query listener a bit more.
We needed a solution that listens to all metadata I/O (files, partitions, tables) of our Spark jobs and stores the info permanently. This metadata could be joined with the external Hive metastore data and we would know which tables (which is only a small subset of all 21K managed tables) we have to move ourselves or convert to external tables to speed things up.
For the custom Spark Listener we extended the default QueryExecutionListener class and overwrote the queryExecutionListeners for each spark job via init script ("spark.sql.queryExecutionListeners" = "com.getyourguide.spark.listener.ClusterListener"). The application then listens to the Spark Query execution plan and parses relevant info as JSON. This JSON is then simply dumped to a MySQL database, so that we can do the querying and transformations later via SQL. A sample JSON payload looks like this:
As a next step, we just needed to dump the MySQL data to S3, so that we can analyze it easily in a notebook. It was much easier to write analytical queries using Spark SQL. We were interested in the inputs and outputs primarily, but the cluster and spark_conf data was very useful as well.
The final step is a simple join of the Spark listener input and output data with the metadata information stored in the Hive metastore database (to retrieve table_type and table location). That way each team could check their respective airflow tasks for managed tables that need to be converted into external tables.
For the final stage of the migration, we turned to Databricks Labs' migration tooling, which provided critical scripts for a one-time, clean switch from our standard workspace to the E2 workspace. These scripts were effectively utilized for migrating users, clusters, notebooks, jobs, secrets and ML models. This step was taken only after setting up the external hive metastore, and finishing up the Managed Tables migration, thus ensuring a smooth and final transition to the E2 environment.
When we decided to set up our Databricks and AWS infrastructure, Terraform immediately stood out as the right tool for the job. Here's why:
All relevant assets needed for a successful migration were created via the Databrick and AWS Terraform provider, amongst others:
The following architecture diagram taken from the official databricks provider docs illustrates our Terraform setup. In our case, it is slightly modified as we are using one single VPC (IAM role and databricks network) for all workspaces.
In short, Terraform has simplified our Databricks Platform journey and set us up for success with future cloud infrastructure projects.
We were able to migrate our whole Databricks infrastructure with 2FTEs within 6 months (start end of November 2022, roll out mid-May 2023).
As a wrap, these were our main learnings:
1. Basics of project management
2. Define a clear north star goal as well as milestones
3. Another learning for projects that exceed 1 month of work was to use milestones. With milestones you have something to present in between the project. They also give a good feeling how much you are aligned with keeping the timeline in terms of your initial planning. We defined 3 milestones, with a few subgoals were:
A big shout-out to Marketing, Data Products, Paid Search and all other teams involved for all the help. Your support and cooperation played a crucial role in the success of our migration.
How Sequential Testing Accelerates Our Experimentation Velocity