Cant find the role you’re looking for?
Sign up to be the first to know about future job openings at GetYourGuide
Stay notified of new jobsCopyrighted 2008 – 2023 GetYourGuide. Made in Zurich & Berlin.
In this series, Giovanni Corsetti and Shaurya Sood, Data Engineers at Core Data Platform, delve into the integration of dbt with Databricks. This first part outlines a step-by-step guide to running dbt jobs on Databricks job clusters- a method the team initially explored to optimize data pipelines. While the approach was ultimately replaced, it remains a cost-effective and resource-efficient solution worth sharing.
{{divider}}
At GetYourGuide we rely on a proprietary tool called Rivulus for SQL transformations in databricks job clusters. However, proprietary tools often come with challenges like maintenance complexity and steeper onboarding curves. That’s where dbt (data build tool) stood out. With its strong community support, built-in features like unit tests, pre-and post-hooks, and community-maintained packages, dbt offers significant advantages over Rivulus.
A common issue for those using Databricks with dbt is the lack of support for job clusters, which can push users towards more expensive alternatives. As mentioned on the official dbt website,
When you run a dbt (databricks) project, it generates SQL, which can run on All Purpose Clusters or SQL warehouses
The only support that Databricks provides in that regard is to submit dbt tasks in job clusters, however, job clusters are only used to trigger a dbt CLI command that calls a SQL warehouse under the hood.
Only a SQL warehouse or all-purpose compute can be used as the target for a dbt task. You cannot use job compute as a target for dbt.
In May 2024, Leo Schick proposed a workaround to run dbt on Databricks job clusters. I highly recommend reading it since most of his struggles and findings align with my experience trying to solve the same problem. His method utilized custom scripts and Databricks notebooks to trigger dbt-spark jobs, bypassing the dbt-databricks adapter.
Although this solution is functional, it did not meet the requirements we were looking for at GetYourGuide, such as
At GetYourGuide, we developed a solution leveraging Docker, encapsulating the dbt project along with a dbt-spark adapter’s profiles file. This approach involved creating a Dockerfile that bundled the necessary configurations and then pushing the resulting Docker image to a private container registry. Once in the registry, this image was deployed using theDatabricks job run API, enabling the seamless execution of dbt tasks within job clusters.
This solution avoids using notebooks and aligns with other similar workflows running at GetYourGuide. Although we did not adopt this approach due to the complexity of local development, it remains a cost-effective solution with many benefits. The implementation process is described in more detail in the next sections.
The following figure illustrates this process.
Running dbt-databricks directly in job clusters is not supported at the time of writing. However, using dbt-spark is a viable alternative since Databricks operates on Spark under the hood.To enable this setup, the dbt project is encapsulated into a Dockerfile, which includes a dbt-spark profiles.yml file pointing to the current Spark session
Here’s how to achieve this:
Now, with the .dbt/profiles.yml file in place, two additional things are required:
Now you have:
It is safe to say that the ingredients are all on the table, but we still have to cook them in order to make the most of it.
Deploying a dbt project to Databricks job clusters requires a well-structured CI/CD pipeline. The figure below illustrates a possible CI/CD pipeline for a dbt GitHub repository that will make use of job clusters:
To create the docker image, the branch name is used as a tag suffix. Since every push essentially generates a new image, this can lead to an excessive number of images in the container registry. To circumvent it, it is recommended to add a policy to clean ci-related images at regular intervals.
Once the Docker image is stored in the container registry, it can be used to trigger a dbt command in the CI or in airflow that will run a dbt transformation inside a job cluster. To facilitate this, a custom Python script,trigger_databricks.py, was developed.
The python script requires 5 key arguments
Although this solution was effective, at GetYourGuide we ultimately shifted to using a serverless warehouse. The decision was driven by the smoother developer’s experience.
The integration between dbt with databricks is still not optimal for the users aiming to use job clusters. However, for teams prioritizing cost savings and resource control, the steps outlined in this guide provide a practical and efficient approach. By leveraging Docker images and the databricks job run API, it is possible to overcome limitations and harness the power of dbt within a databricks environment.