Edoardo Albergo is a Data Engineer in our Marketing and Marketplace Intelligence team. Here, he outlines the rationale, approach, and challenges behind transforming existing notebooks into a new pipeline supporting our brand marketing activities.
Our talented Brand Analytics team can deliver a perfectly functioning solution to collect the data that will feed their models. However, when it comes to productionalizing, it’s the job of data engineers like me to deliver the most efficient solution for them to work.
Last quarter I took care of reorganizing our brand architecture, which is the ETL that processes and delivers data related to our brand marketing efforts. Specifically, I took ownership of multiple notebooks that were delivering the training dataset and structured them in a unique pipeline.
This has been a great opportunity to build a complete architecture that covers one entire domain of our company. It gave me the chance to develop a strategy that will come in handy in future projects.
When a working solution is already in place, the value provided by data engineering is more than just delivering the data; it should also bring significant improvements to the process that delivers this data. Stating these principles at the beginning of the project sets the direction of the job and guides the implementation.
As always when we are about to build a pipeline, the first question that should come to our mind is “What kind of use cases will it serve?” This question is the one that drives all the actions that will follow. It sets our mindset to building something functional and effective, making a clear distinction between nice-to-haves and actual requirements.
Delivering data exactly when they are needed allows our stakeholders to save time and prioritize their work, but that doesn’t necessarily mean that our transformations should be executed as frequently as possible. Timeliness is about finding the best compromise between immediate availability and a rational use of resources
Clearly no one writes solutions with the goal of providing incorrect information. Still, when building a pipeline that will execute the same code over and over, there is no guarantee that what delivers the right data today will keep doing so forever. One of the biggest improvements we can make when turning notebooks into a proper architecture is implementing automatic validations. This way we will immediately detect inconsistencies, thereby improving the overall reliability.
Chances are, if the solution works well, there may be the need to expand it in the future. Spending some time thinking about scalability will ensure a long-lasting life to the architecture. Here is where some solid knowledge of data engineering can make the difference between a temporary solution and a solid infrastructure.
That comes with ownership: once it’s yours, you maintain it. Maintainability doesn’t impact the stakeholder directly, but it’s an essential pillar to comply with the timeliness and accuracy principles. It makes it easier for other engineers to contribute and for the system’s owner to sleep at night. Additionally, it improves the efficiency of the whole team, which will have more time to support stakeholders’ needs.
There are some specific challenges linked to turning notebooks into a scalable pipeline. Most of them come from transferring the ownership from the developer(s) of the code to the data engineer who needs to put the blocks together.
The main challenge comes with getting confident with something we didn’t write in the first place, and understanding the logic behind every step. The more extended the pipeline, the harder it is to get a comprehensive idea of the entire process. It is essential to fully understand the code's context and impact before making any modifications to ensure that the final result’s integrity is maintained
There’s no one single way to build things and we may not be familiar with the style of the person who developed the code. Also, there may be dependencies between one transformation and other parts of the system that are not immediately apparent, making it difficult to understand the impact of changes to the code.
If we’re lucky, the code will be organized in one single notebook, structured into sequential steps. If we’re not lucky, we may have to deal with multiple transformations that are stored in different notebooks that get executed at different times and may have different owners.
Choose the correct design
One of the biggest challenges is to decide how our pipeline will look. Notebooks, by their own nature, are often written as a series of ad-hoc, one-off scripts that are difficult to integrate into a larger system. It could be tempting to speed up the implementation phase by preserving the notebooks’ original structure, but this comes at a very high cost in terms of maintainability, scalability, and overall ownership.
To each challenge, its strategy. These four steps are a gradual approach to the challenges we just stated.
Preliminary phase: First reading of the code
This very much relates to the Confusion challenge. Gaining a high level understanding of the code is an essential step in building an architecture. It could be useful to break the code into modules that provide intermediate results and document each of them in plain English. This could very well be done directly in the code, to ease future readings. It’s important to resist the impulse to optimize that join, remove that redundancy, apply that trick that could really improve an intermediate step. There may be dependencies between the code and other parts of the system that are not immediately apparent, making it difficult to understand the impact of changes to the final result.
Ideally, at the end of this step we should have a couple-of-rows description of what each query takes as input and returns as output. This will give us a high level understanding of how each piece of code contributes to the final result.
Understand your system: Sources Mapping
This one will take time, but we already know that conceptual maps are the data engineer’s best friends. To draw a map of how the sources flow inside each single transformation and what the hierarchy is between them will be particularly useful in case the pipeline to adopt is spread across different notebooks, as it will help us get a better sense of how the transformations relate to each other.
At the end of this step, we will have an intuition of which steps have similar scope and can be aggregated, and which others should be split to increase their readability and performance.
Hands on code: Redesign the pipeline
It’s finally the moment to transform the intuition gained from the previous step into a new design for the pipeline. This process is going to be iterative; most likely we’ll get new insights from working directly on the code and the draft will be updated consequently.
This is also the right time to improve the performance and the scalability of the transformations. In my case, I had two pipelines delivering similar results for two distinct geographies. The solution was to break them down into smaller steps and group those that could be scaled for multiple geographies, while separating those that are specific for a single location. Now scaling most of the transformations is just a matter of adding a line of code.
At the end of this step, our new pipeline will be ready for testing and to be sent to production.
Finalization: Validate your pipeline and choose the execution frequency
Once the new pipeline is ready and tested, we’ll have to find the right balance between constant monitoring and actual data usage, which may relate very much to the use case that this architecture will serve.
The model fed by our brand architecture will be run once per quarter, so a daily pace would definitely be a waste of resources. Still, if we would only run the model ad hoc, when data is needed, the timeliness of the architecture would be at risk. In fact, it wouldn’t be possible to keep an eye on the problems that occur to every ETL, such as incorrect or deprecated sources, outdated logics, or memory shortage.
Choosing an intermediate solution, such as running our pipeline on a bi-weekly pace, enabled us to find a good compromise between constant monitoring and a reasonable usage of resources. This way, we are able to fix issues as soon as they occur, and our data is ready with no surprises.
To summarize, there are clear advantages to turning a series of notebooks into a scalable pipeline, though it’s a project that requires a structured approach.
Stating the project’s principles helps the developer set guidelines for their work, and understand what kind of improvement to strive for. Gradually approaching the code and spending a significant amount of time understanding the pipeline is useful to overcome that sense of confusion that comes naturally when dealing with someone else’s solutions.
We are now about to scale our architecture for the second time, to provide the Brand Analytics team with timely and reliable data that will help them to predict the results of our marketing efforts.
Engineering Manager Series Part 6: Systems Health and How to Create a DevOps Culture
Growth Path for Engineers at GetYourGuide
How we Standardized Machine Learning Observability Across Teams