Cant find the role you’re looking for?
Sign up to be the first to know about future job openings at GetYourGuide
Stay notified of new jobsCopyrighted 2008 – 2023 GetYourGuide. Made in Zurich & Berlin.
In our last article, we talked about the challenges we faced and the key lessons learnt when launching a Generative AI (Gen-AI) solution at scale. In this post, we’ll discuss a statistical method that was particularly useful for measuring the impact of our experiment. We explain the circumstances that led us to rely on this methodology and how powerful and widely applicable this technique is. Lastly, we’ll summarize our learnings and talk about best practices when tackling tricky analytical challenges.
These insights will prove useful to readers who encounter tricky analytical challenges, want to find solutions, and aim to go the extra mile to ensure their innovative work doesn’t end up as a one-off endeavour.
A large percentage of A/B tests focus on measuring how well users convert when exposed to a treatment variant (commonly known as B) and a control variant (commonly known as A). This conversion rate is often measured as a percentage of users taking a certain action as a proportion of all users exposed to the variant. A Z-test of proportions is then applied to validate whether the observed difference between variants A and B is statistically significant.
However, this methodology is based on a set of assumptions that can easily be violated in real-life situations. Failure to meet these assumptions can invalidate any result we obtain.
One of the ways this can occur, which we also encountered when launching our Gen-AI feature, is when our randomization unit differs from our analysis unit. This means that the “success” of each unit/user isn’t a single outcome. Rather, it can differ based on each observation we record for a specific unit. This also means that certain units (i.e., users) can have a much higher impact on the overall outcome of the experiment and can also single-handedly sway results towards A or B.
Even evaluating the result of an experiment based on the session-level, and using click-through rates violates a very important assumption. Multiple sessions can come from the same user, and thus our observations are no longer i.i.d variables. Additionally, this can result in an underestimation of variance, which in turn can give us a higher Z score and hence a higher likelihood of falsely rejecting the null hypothesis (you can read more about this phenomenon here). We thus need to explore alternative methods to analyze the impact of the experiment.
During the launch of our AI content creator experiment, we faced a similar issue. Our primary metric of interest was Activity Submission Conversion Rate, which measured how likely an activity provider was to complete submitting an activity after they started creating it. Similar to our session click-through rate example above, this was a case where our analysis unit did not match our randomization unit.
Additionally, we wanted to measure the success of our feature on travellers. One of our core hypotheses was that AI would assist activity providers in creating better-quality content, which would lead to improved performance for travellers as well. Measuring this was particularly tricky — we were running the experiment by randomizing assignments for activity providers, but we wanted to measure the impact on travelers, too.
Because these activities were new, we couldn’t compare their current performance with past results to see how AI helped boost performance. Setting up an experiment where travelers could see a non-AI-generated version of an activity vs an AI-generated version was also not possible. Additionally, we also couldn’t split activity providers into variants in a way that guaranteed similar performance in terms of traveler metrics, because we had no guarantee that these activity providers would create new activities during the experiment or that their new activities would even be reflective of or similar to their previously created activities. Therefore, predicting an activity’s performance as soon as it goes online is highly complex.
Thus, we had to navigate through two tricky measurement problems, one being measuring activity submission CR based on activity provider level randomization, and the other involved measuring traveler performance for activities that had just gone live.
Permutation testing is a non-parametric statistical method used to determine the significance of an observed result i.e. an experiment in our use case. Unlike a z-test that relies on multiple assumptions, permutation testing makes very few assumptions. It allows for testing hypotheses without assuming normality or homogeneity of variance, making it particularly useful when those assumptions can be violated. In the table below, we can see what assumptions are required for each test.
The permutation test primarily works by creating a null distribution of our test statistic by repeatedly shuffling or permuting the data and then recalculating the test statistic. Then, by comparing the observed test statistic to the distribution of permuted statistics, we can determine the probability of obtaining the observed result by chance alone. This article provides a very nice illustration of how it works.
This method also works very well in calculating traveler metrics for the AI content creator A/B test. We evaluate whether the observed differences for our set of metrics are significant enough to be likely caused by experimental manipulation rather than random variation.
Some additional considerations when running a permutation test include:
Lastly, the labels should be shuffled on the same level as in the original experiment. For example, if our test statistic involves calculating a session-level metric, but our randomization was done on a user level, the labels should be shuffled on a user level (and not the session level).
At GetYourGuide, in addition to our experimentation platform, we have an expansive internal Python library and Databricks notebooks that support analysts in a wide variety of analyses. After the conclusion of the experiment, we decided to expand our tooling to include permutation testing. This meant that anyone at GetYourGuide could now use these tools to both analyze and visualize the results of A/B experiments based on permutation tests. This not only allowed users to evaluate custom metrics for specific experiments, but also made it possible to compute pre-defined statistics for common use cases. For example, activity performance is a commonly used metric and this could now be evaluated just based on a list of activity IDs and the variants in which they lie.
This makes running permutation tests as easy as writing two lines of code. We also added detailed documentation on how the test could be run, what runtimes could be expected (since specifying longer iterations can significantly impact run times), and the ability to save computationally expensive results by simply specifying the name of a table they should be stored in.
In addition to building a capable and easy-to-use tool, we also discussed the specific problem we faced and how we solved it using permutation tests in multiple meetings and forums in the analytics organization. This included department-wide learning sessions as well as wider forums like our Tech All Hands. Doing this garnered interest and pushed adoption.
Lastly, sharing solid examples and success stories about how the new approach (and its accompanying tooling) could solve other challenges helps analysts to understand its value, motivating them to use it.
In this blog post, we discussed the limitations of traditional Z-tests in analyzing A/B experiments, particularly when certain assumptions are violated. We then highlighted a common situation where the randomization units can differ from the analysis unit and how this can lead to invalid results. We then introduced permutation testing as an alternative statistical method that requires fewer assumptions and is more robust in such scenarios. We learnt how:
We hope that these insights and learnings prove valuable to you in your own endeavors and help you to maximize the impact of your work.
Special thanks to Agus for being our sparring partner and Raslam for reviewing this article.
Interested in a pioneering career at the forefront of travel and tech? Check out our open roles.