Cant find the role you’re looking for?
Sign up to be the first to know about future job openings at GetYourGuide
Stay notified of new jobsCopyrighted 2008 – 2023 GetYourGuide. Made in Zurich & Berlin.
At GetYourGuide, we rely heavily on data to understand our customers and suppliers better. Therefore, we use Looker continuously for our Business Intelligence analysis. Recently, we migrated our Looker data source from Snowflake to Databricks to centralize our data warehouse and reduce operational costs. This migration led to a 20% reduction in costs and simplified our data infrastructure. Robert Bemmann and Houkun Zhu from our team share their insights and experiences from this significant transition.
In this article, we delve into the key reasons for the migration, the strategic approach adopted for the proof-of-concept (PoC), and the successful outcomes achieved.
{{divider}}
GetYourGuide’s Data Environment
To understand our migration process, here’s a quick overview of the key components of our data environment:
Motivation
As the core data platform team, we ensure the platform remains healthy while carefully monitoring and managing operating costs. At the end of 2023, we revisited the toolset used for our presentation layer, which included Snowflake and Looker.
Our data lakehouse on top of AWS S3 as a storage component is our source of truth, and is fully managed via Databricks as a data processing tool.
At the time of the assessment, Databricks offered plenty of new features tailored to support BI workloads. Additionally, removing the extra step of copying data from Databricks over to Snowflake would simplify our data environment. Thus, we wanted to evaluate a Databricks-based solution as an alternative to our Snowflake-based setup.
Our primary KPIs were the overall query performance (i.e. how long business users have to wait to load reports in Looker) and the operating cost of the query engine.
Scope of migration
Our self-service business intelligence tool Looker contains our whole business reports logic, expressed in its syntax language LookML, that builds SQL queries for fetching data from the database connection. The following numbers shall provide a better understanding of the scope of our Looker deployment and the objects that needed to be changed as part of the migration:
The second big part was the actual migration of the underlying tables in our data warehouse. All Looker reports used 750 tables managed by an external hive metastore. The tables were written as external tables using Apache Parquet data file format. However, considering Delta Lake offers various data layout optimizations and unified data governance via Unity Catalog, we also decided to migrate Parquet data assets to Delta tables as Unity Catalog securables.
Our Northstar goals
Our most crucial constraint was to ensure zero or minimal disruption to our Looker deployments and the external integrations. From a performance perspective, the goals were that 90% of queries should execute under 10s and more than 95% of queries should be working (i.e., no SQL syntax errors). The third constraint was that overall costs for storage and computation should be lower as with deployment before the migration.
Before fully committing to the migration, we ran an extensive PoC to validate our assumptions in order to fully understand the behavior and configuration of the new setup. The key objectives of the PoC included:
Let’s break down all the goals into single steps.
Enabling Connectivity
The Databricks Serverless compute resources (used for Serverless SQL warehouses) run in a compute layer within your Databricks account and thus have their own IP range. The challenge for us was to make our VPC and the MySQL RDS database for the External Hive Metastore from our cloud account in AWS accessible for the Serverless compute resources. Connecting Looker and other tools to the SQL warehouse was easy to implement using a JDBC/ODBC connection.
Compare BI workload execution times
From a performance perspective, our goal was to meet or beat the query runtimes we previously achieved on Snowflake. To validate this, we compared runtimes for our 14 most frequently queried Looker dashboards by loading the necessary tables and making SQL syntax adjustments in a Looker development branch.
We codified the measurement process using the Looker SDK to automate and repeat tests during the PoC phase. One very handy feature to highlight is the possibility of running your queries in Looker development mode. You authenticate via your personal Looker access token, and thus, Looker is smart enough to figure out which development branch you operate on. This allowed us to execute identical queries in both production and development modes.
For the performance comparison, we relied on runtime statistics from Looker rather than the data warehouses directly, as end users care about Looker’s execution times, including Looker result caching and other optimizations. We repeated the runs 5 times and averaged the results. Of course, we also disabled the result cache for both jdbc connections.
Our findings were mixed: on one hand, some dashboards were significantly faster, while others were noticeably slower. However, we learned that by investing in engineering efforts - such as SQL query optimizations or fine-tuning Delta table configurations (e.g., partitioning columns, Liquid Clustering, or adjusting Delta target file sizes) - we could improve performance on slower dashboards.
The conclusion was that the performance benefits outweighed the disadvantages of the slower dashboards. Our rationale was that the percentage difference had to also take into account the absolute difference in seconds: a dashboard that takes 1.6 seconds longer to load but is 165% slower than the Snowflake equivalent is not as impactful as a dashboard that is 69.8 seconds faster and 77% faster than on Snowflake. Furthermore, we were confident we could optimize the performance on Databricks even further.
Find the sweet spot for a performant SQL warehouse configuration
For this task, the actual goal was to find the sweet spot between good performance for an equal (or, ideally, lower) cost. Larger warehouses (determined by T-shirt size and number of clusters) deliver faster query performance, but they also increase costs. Our benchmark was the existing operational costs of our Snowflake warehouse.
To ensure consistent performance comparisons, we applied the same conditions to both connections.
Our variables for experimentation were:
To illustrate the variables, you can see two instances of the codified BI workload runs: the first run used an L-sized warehouse (with lower spikes of the maximum of queued queries in 5-minute timeblocks), and the second warehouse used an M-sized warehouse.
Cost projection
Finding an optimal SQL warehouse configuration allowed us to make a direct cost comparison between vendors. By dividing the measured Snowflake Credits by the measured Databricks DBUs, we calculated a multiplier that enabled us to project the cost on a Databricks-backed Looker deployment.
Here is how we approached it:
The following sample numbers illustrate the distribution of cost and Credits/DBUs in percentages.
With the final numbers, that turned out positive in our case (i.e., saving potential), we were able to make an internal proposal for the full migration.
Learnings during the POC
Migrating from Snowflake to Databricks requires us to replace existing Snowflake tables with Databricks Delta tables and update systems and services to consume data from these new tables. As shown in the graph below, our table loading is maintained via Apache Airflow and output tables are consumed by Looker and other systems. Therefore, the migration mainly involves three parts:
We will cover each step one by one in the subsections below.
In total, we have around 750 Snowflake tables to migrate to delta tables with some dependencies between them. For example, table A depends on table B, and thus we should load table A only after table B is loaded. To manage this complexity, we utilize a centralized configuration to configure the loading process.
As shown in the script below, in this centralized configuration file, one can define the Airflow dag and task_id as the data source, and output delta table information. Additionally, one can also add extra parameters for some loading tasks.
The above configuration creates a load task, as shown below.
The centralized configurations not only helped us to maintain the configuration more efficiently but also eased the monitoring process. We set up a daily workflow to check if configured tables have been loaded correctly and have all the data. So, during migration, we have this guardrail to ensure tables are migrated smoothly and safely.
Incremental roll-out plan
Once the data is ready in delta tables, we need to switch data sources in Looker from Snowflake to Databricks tables. Since Snowflake and Databricks SQL have syntax differences, we had to adjust queries to ensure compatibility with Databricks SQL. Moreover, we need to ensure query performance and data consistency after migration, especially as our major business report is based on tables in this migration.
Because Looker projects can import views from other projects, updates can break other projects. Given the complexity and interconnected nature of our non-trivial Looker projects, a full migration in one batch was not feasible. Instead, we conducted an incremental rollout plan to ensure smooth migration. Another benefit of this incremental migration is increasing parallelization, allowing developers to work independently.
Specifically, we
Rollout dashboard
We built a Looker dashboard to track migration progress, as shown below. It presents the quantity and quality of migrated models and helps us ensure our success metrics are achieved.
SQL syntax changes
To bootstrap our syntax change process, we utilize Looker API to spot syntax errors and identify the common incompatibility errors.
With all identified issues, we developed automation scripts to incorporate desired changes and fix these incompatibility issues at scale.
Validating data correctness and query performance
With all the changes in Looker, we validate data correctness and query performance before rolling it out. We have made changes in the development branch. Thus, we must compare the development branch with the production branch. Yet we have a few challenges: 1) We have too many queries and tables to check, 2) A query execution time is a few minutes and can go up to tens of minutes, and 3) It is hard to compare results in two environments as Databricks and Snowflake provided slightly different query results, e.g., different precision for the same tables with the same data and same SQL statements.
To address the above challenges, we took the following validation approach,
The Looker migration involved the most workload in this project due to its scale, complexity, and criticality. However, our incremental rollout plan, a live rollout dashboard, and automated migration and validation pipelines led to a smooth and successful migration for Looker.