Mar 16, 2022

Open-sourcing Db-rocket for Data Scientists

Jean Carlo Machado

Data Science Manager

Working at GetYourGuide

Engineering

Open roles

View all open roles

Jean Machado is a machine learning (ML) platform lead on the ML platform team at GetYourGuide. He explains a small but effective tool we developed to speed up our Spark-based model development for machine learning in production using Databricks.

One of the core data manipulation tools we use at GetYourGuide is Databricks and its notebook solution. With these tools, we’re able to be a very agile data organization.

However, when deploying machine learning models to production, notebooks aren’t the right tool. The best practices of going to production are based on Git projects with continuous integration and continuous deployment (CI/CD).

Once we have a proof of concept machine learning solution and want to go to production, we then need to migrate the code in the notebook to python projects in Git. When projects are migrated, we need to develop them further. Over time, our team explored different solutions to transition from notebook to develop projects in Github, the main solutions we explored are outlined below.

The most commonly used solution was to copy and paste snippets of the project to Databricks notebook back and forth from the repository. This option is very easy to get started but is very manual and error-prone. If you forget to copy and paste something your code will not work and you will have to debug every time you make a change.

For more mature projects, we had an improved process, involving building the Github project as a python or scala library, and starting a Databricks job to run it. The main drawback of such an approach is that it takes a long time to start a job. Expect around 7 minutes to run, even for small projects. For development purposes, that’s too long a wait time to validate a code change.

Another problem is that how we inspect the job results is not great. You have to browse through the logs tab on Databricks to know what’s going on. In an ideal development environment, you should see the results of your code changes in seconds at most, and its results should be visible on your screen immediately, rather than requiring you to navigate to another page.

Databricks provides many tools for improving the development experience. One is databricks-connect, which allows someone to run spark queries from their local machine with the cloud data accessible as if it was present locally. Databricks-connect is great for many use cases, such as when you need to run a pyspark query and get fast results from your local machine to debug something in your main cluster, which is very common for analytics use-cases.

But databicks-connect doesn’t work so smoothly with our data-science workflows. Data science projects at GetYourGuide run on their own clusters and have different spark versions requirements per project, while databricks-connect isn’t easy to configure if you have multiple clusters running multiple spark versions you have to install multiple libraries.

Second, the databricks-connect tooling isn’t well integrated with the other data science tools (like Pycharm or Jupyter). For example, when running queries, the progress bar isn’t usually displayed, so it’s not as easy to see what’s going on.

Third, with databricks-connect, we bring the data to our code rather than the code to our data we feel isn't the right architecture when dealing with big data. Given that databricks-connect partially runs the code on the local machine and transfers data to the cloud, we experienced queries failing often and higher latencies than the normal interactive cluster.

Given these findings, we decided to look for ways to improve the development experience.

We saw that python libraries can be installed at the notebook level, so why not automate sending code under development to the cluster?

That’s why we created a small tool called db-rocket.

Db-rocket

Db-rocket’s job is very simple. It will wait for you to change your git project locally, and once you change something, it builds a python library out of our project and makes it available on a Databricks File System (DBFS) path.

Once there, you just need to run the installation on a notebook cell, and you’ve got the new version!

*Databricks notebook using 2 different versions of the same library (db-rocket itself in this case)*

With this approach, we no longer need to copy code from a repository to the notebook. We can import any part of the library and use it in the notebook. We can also change or add new parts directly in the code, or in the cell, depending on what we need at the moment.

When compared with running as a job, the feedback loop from changes to results takes around 15 seconds instead of the prior 7 minutes! Plus, the code now goes to the data, which makes the whole experience more stable and performant.

Another advantage is that db-rocket can leverage the refactoring tools of the integrated development environment (IDE) on a local machine, while using the code in a notebook. We don't need to do the heavy setup on the local machine for developing a project (such as mock data, secrets, or monitoring). Cloning a Github project locally and using db-rocket should be enough to get your code development ready.

The tool was a major productivity gain for our Data Scientists and is now used by multiple teams at GetYourGuide. We’ve also discovered surprising new uses for it. Some teams are now using db-rocket as a way to serve experimental prototypes to other stakeholders. You can give preview access to your model, for instance, by installing the library you’ve been developing on the notebook cell of another person.

Next steps

There are many things to improve on db-rocket. It would be great if there was a way to install a library on a notebook programmatically rather than having to run the installation cell again. Installing a new library version has additionally the side-effect of cleaning the notebook state which we would like to avoid.

Finally, we’re still not totally happy with the ~15-seconds delay to build a library and make it available in the notebook. The wait is still too long to be completely in flow while developing data science projects. Nevertheless, db-rocket clearly improved our productivity when compared with previous methods.

In the future, Databricks will support seamless IDE integration, which is probably the most effective way to develop. In the meantime, db-rocket can save you some precious time.

If you’re interested in speeding up your development speed on python projects with databricks, give db-rocket a try!

If you’re interested in joining our Engineering team, check out our open roles.

‍

Join the journey.

Our 800+ strong team is changing the way millions experience the world, and you can help.

Open roles

Related blogposts

Data Science

September 14, 2023

Data Distrust: How to Rebuild Confidence in Problematic Domains

Highlights from PyCon De PyData Berlin 2023

PyData Berlin: How The Data Community Comes Together - Feb Edition

GetYourGuide

Careers Team

Keep up to date with the latest news

Oops! Something went wrong while submitting the form.

Life at GetYourGuide

Our teams

Tech at GetYourGuide

Locations

How we hire

Blog

Open-sourcing Db-rocket for Data Scientists

Db-rocket

Other articles from this series

Featured roles

Marketing Executive

Marketing Executive

Marketing Executive

Join the journey.

Related blogposts

Keep up to date with the latest news