Today we hear from Sam Crouch, Data Engineer, on the importance of building a simple and sustainable Machine Learning system.
When a Data Scientist locks onto a winning idea that bumps up revenue, the employers of said Data Scientist usually rejoice. Sometimes however, all the bells and whistles of revenue distract from the code the revenue depends upon, where technical debt and complexity might be lurking.
We recently experienced this after effectively applying a machine learning (ML) model in the bidding system of GetYourGuide's targeted market system. By using the performance data of ads and search requests from Google, we created a model to predict the values of search terms on Google search. The addition of this functionality to the targeting system significantly increased revenue and even led to the creation of a new team around this prototype. However, that’s just it. This was just a prototype actually running in production.
Then, when the next quarter rolled around, the prototype became the new baseline, and the calls for innovation and growth rang out again. In the new quarter we needed the ML system to be flexible to new ideas and resilient to issues and bugs. Therefore, the new team needed to examine the prototype and see how it could be molded for faster, long-term iteration.
First off, the system should be as simple as possible. Machine learning systems don’t need to be all cutting-edge platforms and custom-coded models using the latest research. Sure, cutting-edge can be a goal, but even in ML, KISS should still apply. A simple ML system should have the following specifications:
*One note on unit-testing. Tests should only be written for the data transformation components of the pipeline before and after the use of an ML package like MLLib/scikit-learn/sciruby (I’m gunning for an unlikely rise in ruby machine learning). If you’re writing tests for your ML, maybe you have too much abstraction happening.
The team needed to evaluate the prototype based on the previous points. It became immediately clear that the prototype met only one of these specifications: “using an existing ML framework”. In order to tick all the remaining boxes, we needed to understand the system’s limitations.
The development/maintenance process of the prototype consisted of writing/editing the code effectively in production, connecting to, and then operating on production data. This meant any test mostly consisted of running the code on huge data sets and eyeing the resultant data, resulting in error-prone testing and long iteration times between any code change. A corollary of this was that any suggestions from code reviews took a non-trivial amount of time to test. Consequently, stylistic suggestions and potential programmatic or resource optimizations were justifiably ignored.
Each code change led to a further compounding of complexity. This process resulted in a snowball effect, down a slippery slope, causing a chain reaction of butterfly effects. Well, not quite that severe, but you get the picture.
Furthermore, the deployment/rollback process of the prototype was not optimal as the codebase didn’t have inbuilt versioning. Without this automatic versioning, we wouldn't have a version to roll back to in case of emergency.
The prototype was also lacking unit-tests. There are many discussions to be had about the value of unit-tests, but I would wager that most of the participants in that argument would concede that a system responsible for such a significant majority of capital should have some level of programmatic assurance.
So, following recommendations from the DataBricks documentation, we considered moving the code to a locally-runnable, compilable Scala library. This solved the requirements in the following ways::
So, we had decided on a goal, but now we needed to understand how to roll it out into production. The production system was responsible for a significant majority of the advertising budget of the entire company, which meant we needed to make sure this rollout wouldn’t cause any hiccups. We defined the steps of the current system and the interfaces. We then iteratively replaced each of these steps with the Scala library adhering to these interfaces. Over the course of the next few months, we refactored each of the critical code chunks (notebooks) into various classes in the library along with the necessary and effective unit tests for important cases.
We now have a Scala library running our entire ML system. Because of the new system, we’ve been able to plug in new ideas (new models, new features, new paradigms, new assumptions) significantly faster and easier, write tests when necessary, incorporate code suggestions, deploy fast, and rollback easily.
So now when a Data Scientist comes up with another winning idea, the developers are able to rejoice, as well as the employers.
Data Distrust: How to Rebuild Confidence in Problematic Domains
Highlights from PyCon De PyData Berlin 2023