Engineering

Improving Data Science productivity with DB-rocket 2.0

Learn how DB-Rocket 2.0 revolutionizes your Databricks notebook experience with seamless code synchronization, state retention, and faster installations.

Mihail Douhaniaris

Data Scientist

Key takeaways:

Mihail Douhaniaris, Data Scientist at GetYourGuide, shares the key upgrades in DB-Rocket 2.0 that significantly enhance Databricks notebook workflows. Discover the practical solutions to common challenges like manual installations and notebook state preservation.

In a previous blogpost, we introduced DB-rocket to our community, a productivity tool we developed at GetYourGuide with the aim of streamlining notebook-driven data science workflows. This tool bridges the gap between locally developing and committing code to Github repositories, and testing and validating code changes interactively in Databricks notebooks.

It works by automatically building a package out of a python project and uploading the built wheels to DBFS as soon as local changes are made to the code. This in turn enables the reinstallation of the project package in a Databricks notebook and allows end users to execute the project's source code directly within the notebook environment while having full access to the Databricks data warehouse.

This allows data scientists to work on their code locally and validate their code changes in Databricks notebooks with minimal friction from changing between these environments.

The initial release was met with great enthusiasm and our data scientists quickly adopted the tool for their projects thanks to the efficiency and productivity gains. However, over time a few limitations of this approach became apparent. Today, we're excited to talk about how we addressed these pain points and unveil the improvements in DB-Rocket 2.0!

Based on the feedback we received from our users, as well as our own personal experiences with DB-rocket, we pinpointed three key areas for improvement:

Manual Installation: A primary concern was the manual intervention required by data scientists. Even though a Python wheel was automatically built and uploaded to DBFS with every code change in the local development branch of the user’s repository, they still had to manually run the installation command in their notebooks. When iterating on a data product, especially on days with multiple changes being tested, this approach could lead to a higher chance of errors or oversights. For instance, a user might easily lose track of which exact wheel version was installed in the notebook.
Notebook State Reset: Another notable drawback was that the state of the notebook was being reset upon reinstalling the project using a Python wheel. This meant that any intermediate results, computations, or other data were lost with each reinstallation whenever a change needed interactive validation in the notebook. For data scientists working on intricate projects, this posed a significant challenge. They either had to re-run (sometimes lengthy) computations or risk losing crucial insights.
Installation Times for Larger Projects: As we refined our data products over time and as our projects expanded, we noticed that projects with a larger number of dependencies took an excessive amount of time to install. This delay hampered the process of validating changes in a Databricks notebook, particularly during rapid iterations and the cumulative efficiency loss could amount to multiple hours over a single development sprint.

Improvements in DB-rocket 2.0

As these pain points became clear, we pitched the idea to work on new ways for improving our Databricks notebook-driven development during our bi-monthly company focus days and we assembled a small task force within our data products organization to work on solutions over two days.

We noticed that the majority of the installation time was spent on project dependencies rather than the project’s source code itself which was particularly evident in large projects including multiple dependencies. In particular, this issue became more apparent when we recently switched to Poetry for managing the dependencies of our Python projects. This is because when packaging our projects with Poetry all dependencies are automatically included in the built wheel.

It was evident that we would need to separate the dependency and project installation steps from each other. Ideally, we would need to install the dependencies only once when the notebook is first initialized and cache them in the notebook state. Thereafter, as new changes are made to the local source code we would only reinstall the project package that includes the latest changes from the user’s local environment. We knew that for this part we would need to move away from using a binary Python wheel, as that not only causes the notebook state to reset, but also requires the user to manually execute the installation command in the notebook.

To resolve these issues, we found that installing the project package in editable mode using the –editable flag worked much better for our use case. Instead of building a binary Python wheel, this method directly utilizes the raw files to run the project’s source code. As such, we can then synchronize the user’s local files to DBFS and directly utilize them in any Databricks process.

Finally, to avoid having data scientists manually rerunning installation or import commands after each local change, we can prepend the Databricks notebooks with the following Jupyter magic commands. This makes the changes in the editable project files in DBFS to be instantly usable in the notebook environment without the need to run any installation commands or re-importing project modules. For extra convenience, DB-rocket automatically handles all this logic and outputs the preamble installation lines to be used in the Databricks notebook!

‍

Using DB-rocket 2.0: Modify project code locally and see changes seamlessly sync to your notebook, eliminating the need for reinstalling the project or reimporting modules. Additionally, the notebook state is maintained allowing you to keep all variables and data.

With this approach, the notebook state only needs to be reset whenever its first initialized or a new project dependency is added. Otherwise, under normal development iterations, local code changes are immediately reflected in the Databricks notebook state.

Conclusion

Overall, we are very happy with the new changes included in DB-rocket 2.0. One of the biggest wins is the much faster speed at which local changes can be applied and tested in a Databricks notebook. Previously, a data scientist might have spent an average of 5 minutes waiting for changes to be reinstalled in the notebook and rerunning computations due to the notebook state reset.

With the new improvements, this waiting time is essentially entirely eliminated for most scenarios. If we consider a data scientist making 10 changes a day, the time saved is approximately 50 minutes daily. Over a month, that's a whopping 25 hours saved – equivalent to more than three full working days!

As for what’s next, we remain committed to continue developing and maintaining DB-rocket and improving our data science productivity. As was the case with the changes introduced in DB-rocket 2.0, we recognize that it’s often the small, incremental changes that can make a big difference over the course of a project. That’s why our next steps for now include handling edge cases, and implementing smaller yet impactful impactful improvements, such as improving the file synchronization process of DB-rocket 2.0.

If you’re interested in trying out our new version and elevating your databricks notebook development experience, please have a look at db-rocket!

Shoutout

This iteration of DB-rocket wouldn't have been possible without the support, collaboration and feedback of our team members. Special thanks to Steven Mi, Giampaolo Casolla, Olivia Houghton, Hsin-Ting Hsieh, Meghana Satish, and the team for driving these issues and enabling us to continuously improve productivity!

If you’re interested in joining our Engineering team, check out our open roles.