The Double-Edged Sword of Data Science
Today we hear from Sam Crouch, Data Engineer, on the importance of building a simple and sustainable Machine Learning system.
When a Data Scientist locks onto a winning idea that bumps up revenue, the employers of said Data Scientist usually rejoice. Sometimes however, all the bells and whistles of revenue distract from the code the revenue depends upon, where technical debt and complexity might be lurking.
We recently experienced this after effectively applying a machine learning (ML) model in the bidding system of GetYourGuide's targeted market system. By using the performance data of ads and search requests from Google, we created a model to predict the values of search terms on Google search. The addition of this functionality to the targeting system significantly increased revenue and even led to the creation of a new team around this prototype. However, that’s just it. This was just a prototype actually running in production.
Then, when the next quarter rolled around, the prototype became the new baseline, and the calls for innovation and growth rang out again. In the new quarter we needed the ML system to be flexible to new ideas and resilient to issues and bugs. Therefore, the new team needed to examine the prototype and see how it could be molded for faster, long-term iteration.
The Ideal Machine Learning System
First off, the system should be as simple as possible. Machine learning systems don’t need to be all cutting-edge platforms and custom-coded models using the latest research. Sure, cutting-edge can be a goal, but even in ML, KISS should still apply. A simple ML system should have the following specifications:
- Use existing, well tested frameworks
- Be simple and flexible in code
- Be simple in deployment
- Can quickly iterate on ideas
- Errors in production are simply, easily and quickly reversed with a simple rollback mechanism
- Effective unit-testing*
*One note on unit-testing. Tests should only be written for the data transformation components of the pipeline before and after the use of an ML package like MLLib/scikit-learn/sciruby (I’m gunning for an unlikely rise in ruby machine learning). If you’re writing tests for your ML, maybe you have too much abstraction happening.
The team needed to evaluate the prototype based on the previous points. It became immediately clear that the prototype met only one of these specifications: “using an existing ML framework”. In order to tick all the remaining boxes, we needed to understand the system’s limitations.
Limitations of the Machine Learning System
The development/maintenance process of the prototype consisted of writing/editing the code effectively in production, connecting to, and then operating on production data. This meant any test mostly consisted of running the code on huge data sets and eyeing the resultant data, resulting in error-prone testing and long iteration times between any code change. A corollary of this was that any suggestions from code reviews took a non-trivial amount of time to test. Consequently, stylistic suggestions and potential programmatic or resource optimizations were justifiably ignored.
Each code change led to a further compounding of complexity. This process resulted in a snowball effect, down a slippery slope, causing a chain reaction of butterfly effects. Well, not quite that severe, but you get the picture.
Furthermore, the deployment/rollback process of the prototype was not optimal as the codebase didn’t have inbuilt versioning. Without this automatic versioning, we wouldn't have a version to roll back to in case of emergency.
The prototype was also lacking unit-tests. There are many discussions to be had about the value of unit-tests, but I would wager that most of the participants in that argument would concede that a system responsible for such a significant majority of capital should have some level of programmatic assurance.
So, following recommendations from the DataBricks documentation, we considered moving the code to a locally-runnable, compilable Scala library. This solved the requirements in the following ways::
- We could continue to use an existing ML framework in MLLib
- Using a locally runnable library would allow the developers to write tests about critical components of the code to assure it worked as intended. Additionally, any changes made would be automatically checked by these tests, making any future code changes faster, and iteration time shorter.
- Once the code has tests, optimisations of the code would take significantly less time since they are programmatically checked by the tests. Therefore, code suggestions from code reviews can be easily implemented and checked. This, along with some other code best practices, leads to simpler, more robust and flexible code. (NOTE: the topic of code best practices are too involved to detail here, and will be covered in another article).
- Once the code is simpler and more flexible, it becomes much easier to understand, change, and iterate on with new ideas.
- Once the code is in a library, it can easily be integrated into an automatically deployable workflow with minimal oversight from developers for each deployment.
- A versioned library can easily be swapped out for the previous library, so rolling back is simple.
Rolling into Production
So, we had decided on a goal, but now we needed to understand how to roll it out into production. The production system was responsible for a significant majority of the advertising budget of the entire company, which meant we needed to make sure this rollout wouldn’t cause any hiccups. We defined the steps of the current system and the interfaces. We then iteratively replaced each of these steps with the Scala library adhering to these interfaces. Over the course of the next few months, we refactored each of the critical code chunks (notebooks) into various classes in the library along with the necessary and effective unit tests for important cases.
We now have a Scala library running our entire ML system. Because of the new system, we’ve been able to plug in new ideas (new models, new features, new paradigms, new assumptions) significantly faster and easier, write tests when necessary, incorporate code suggestions, deploy fast, and rollback easily.
So now when a Data Scientist comes up with another winning idea, the developers are able to rejoice, as well as the employers.
Other articles from this series
PyData Berlin: How The Data Community Comes Together - Feb Edition
Inside our Recommender System: Data Pipeline Execution and Monitoring
Understanding Data-Driven Attribution Models
Understanding Data Products and their 4 Levels of Ownership