Engineering

When Scaling Google Ads, Say No to State

The Marketing-Tech team recently decided to take a different approach when it came to managing Google Ads and moved away from attempting to mirror everything bi-directionally.

Paul Baecher

Senior Backend Engineer

Key takeaways:

Paul Baecher, senior backend engineer, shares how we successfully managed our enormous volume of Google Ads through carefully engineered Apache Spark jobs. He explains how he solved a myriad of issues on his path to building our effective new ad-management pipeline, and why persistent mutable state is arguably the number one source of complexity in software.

Automating Google Ads

Search engine marketing, specifically Google Ads, is a significant marketing channel for online marketplace. This channel grew so large early on for us that manual management of Google Ads quickly became infeasible.

Luckily, Google Ads offers a comprehensive API to manage accounts, so automation is just around the corner — if you're willing to invest some time into engineering. If you're a growing company, though, one Google Ads account won't be enough to hold all your marketing material, and you'll have to spread out over multiple accounts. This is where things get interesting.

Since the API is inherently based on an account level, it gets more challenging to answer questions that span many accounts, let alone make changes on that scope. Both latency and throughput will quickly spiral out of control when something takes hundreds of rate-limited API calls to complete.

The next logical step would typically be to set up an in-house database that mirrors the data on Google Ads. Queries across your entire performance-marketing operation can then be answered quickly, without any external API bottlenecks. However, what's easily grasped in the physical world can be a real headache in software engineering. One does not simply mirror a database. Having a local copy of Google Ads comes with its own set of challenges and subtleties. Some questions in need of an answer are:

How can we safely change the local database and Google Ads in one transaction?
What if someone manually makes changes on Google Ads, how will they be reflected in our local database?
How do we reconcile conflicts in general?

Stepping back for a moment, the underlying theme here is that we are now trying to keep two data stores synchronized — heterogeneous stores with differing schemas, over a low-throughput and rate-limited channel. Oh, and we don't have full control over one of them. While this problem is relatively difficult even when some of the significant constraints are removed—think of internal master-master replication in distributed database systems— it is close to impossible to nail it in this setting.

The inevitable result is that these two data stores will not be fully synchronized, ever—not even close. For our marketing operation, that means analytical queries will be incorrect, we are trying to make changes on Google Ads that don't make sense, or we simply work with outdated information. To make things worse, experience has shown that the degree of differences tends to grow steadily over time.

Ephemeral state is better than persistent state. Immutable state is better than mutable state. Private state is better than shared state. Bonus points for combining those.

The Marketing-Tech team recently decided to take a different approach when it came to managing Google Ads and moved away from attempting to mirror everything bi-directionally. We'll have a look at one particular subsystem of our performance marketing pipeline and discover how to avoid most of the outlined pitfalls.

Enter the Apache Spark ads pipeline

One main ingredient for search-engine performance marketing is the actual advertisement fragments displayed on top of organic results. At GetYourGuide, we manage these fragments with our ads pipeline. Just like other parts of our marketing machinery, this system used to rely on locally-mirrored information. It kept around a substantial amount of persistent mutable state, exhibiting all of the issues above.

Since we are handling quite a few ads, this system also makes critical use of Apache Spark. The entire business logic is exclusively expressed in Spark. This adds a few more challenges. Many Spark workloads start their life as an ad-hoc notebook, e.g. Databricks, Jupyter, and so on, often hastily thrown together. When, not if, they are put into production, standard engineering practices like testing or general code hygiene are often an afterthought.

When we set out to redesign this bit of our system, we wanted to address both of these issues. Our new implementation should avoid persistent mutable state if possible and lends itself to testability and provide high robustness.

Engineering a Spark workload

Let's talk about state first. Persistent mutable state, especially when shared across multiple systems, is arguably the number one source of complexity in software engineering. With the exception of most trivial cases, state is usually unavoidable. However, the payoff is enormous if we can negate some adjectives in front of state.

Ephemeral state is better than persistent state. Immutable state is better than mutable state. Private state is better than shared state. Bonus points for combining those.

For our new ads pipeline, the inputs are daily reports from Google Ads. These reports are essentially complete data dumps and can be considered pretty much immutable for any given day in the past. The pipeline itself operates in a stateless fashion: All decisions are based on those input reports and some command-line arguments.

There is no immediate state dependency from one run to another. The pipeline is idempotent; it can be re-run for any day in the past to produce the same output. In a sense, it is very close to a physical pipeline—there are no hidden reservoirs in it. It turns out that a good design on the micro-level, namely pure functions, is also a good design on the systems level.

Testing Spark jobs

Speaking of pure functions brings us directly to our second design goal of testability and robustness. A pure function is a function in the mathematical sense, meaning that it always returns the same value for a specific argument, and it doesn't have any side effects like performing input/output work. These properties are highly desirable for testing because they happen to make testing very easy and very fast.

They also make it very easy to reason about the code, since the entire behavior can be seen and understood in isolation. Unfortunately, at first, purity seems to be entirely at odds with something like the Apache Spark framework. Spark distributes large computational workloads across a cluster of many machines over the network, and it's not atypical that a lot of the run time is attributed to input/output as data gets shuffled around both between and within machines of the cluster. In a way, that is the exact opposite of what is desirable.

Given this situation, it's not surprising that Spark workloads are rarely well tested. In fact, testability is usually not a priority in the design phase, if there is a design phase at all. To make it a priority, we ensured that our new ads pipeline would be at least conceptually pure. All input/output is pushed towards the very edges of the pipeline—reading input reports, writing output data, metric collection, and submission, and so on.

These functionalities are basically injected as dependencies at the very start, and we provide test versions that don't perform any actual disk input/output during testing. A neat side effect, pun intended, is that this speeds up tests by quite a bit, especially when combined with the fast-spark-tests package. Coincidentally, the documentation of that package recommends just what we are doing.

More concretely, that means we have a storage interface with some methods like loadSomeTable() (respectively storeSomeData()), and the production implementation does nothing more than calling spark.table() (respectively dataframe.write...()). This is the idea of having input/output on the edges; anything between reading and writing are pure data transformations. It goes without saying that none of this is new, but not necessarily on the radar when dealing with Spark.

Experience so far and conclusions

Since we fully rolled it out, the ads pipeline has been solid. We virtually didn't have any failed runs in six months, and we were quickly able to set up a second copy of it for an unrelated migration. And a third copy for manual testing with real data. Its logic is fully tested, and a complete test run takes less than three minutes. We attribute most of this success to testability and statelessness.

Of course, it is not always possible to make systems entirely stateless. Sometimes it is, but hard trade-offs against business requirements need to be made. Either way, the further you can get away from persistent, mutable, and shared state, the better. Next time you design a system, try to spend a bit more time contemplating if there is an opportunity to say 'no' to state!

‍