Cant find the role you’re looking for?
Sign up to be the first to know about future job openings at GetYourGuide
Stay notified of new jobsCopyrighted 2008 – 2023 GetYourGuide. Made in Zurich & Berlin.
Nowadays, automation is a buzzword, and automating a process opens growth opportunities. App release monitoring (ARM) is no exception. ARM represents a suite of innovative tools designed to monitor the health and stability of iOS and Android app releases. These tools provide real-time updates by sending notifications to Slack channels and logging the app's status throughout the release process.
At GetYourGuide, we use ARM to monitor the rollout of our Android and iOS apps, starting from the moment they are submitted to the Google Play or Apple Store. The system relies on app events, such as sessions, crashes, etc, exported from Analytics and Crashlytics to Bigquery.
{{divider}}
Before we dive into the tools, we need to explain a few key concepts. We know this section might not be the most exciting, but stick with us- it’ll be worth it! Your patience will pay off!
Metrics play a key role in helping us evaluate the overall health, performance and stability of our app releases. ARM relies on two key metrics to monitor and ensure a smooth rollout during the release cycle.
The formula:
Our tools are also based on a set of service-level objectives (SLOs). Generally, SLO stands for service-level objective, which acts as a key benchmark for app stability, performance, and availability. An SLO is an agreed-upon target defined by the tech team that sets the minimum acceptable metric level. For example, the Crash-free session rate, which is a stability SLO, needs to be at least 99.95.
To measure whether we’re meeting these objectives, we have the service level indicator (SLI), which is a real-time measurement of a specific metric, such as crash rates. The goal should be to have an SLI higher than SLOs in terms of the metrics.
At GetYourGuide, we follow a company-wide standard that all SLOs must meet the 99.95% threshold for critical metrics like crash-free sessions and latency.
MoE is a statistical concept used to measure the uncertainty of sampled data to infer a hypothesis against the total population. The higher the MoE, the less reliable and predictable the surveyed data is. It is used to get the confidence interval, the range two times the MoE.
With the confidence interval, we can say that the measured value is within a range of a Lower bound (measured value - MoE) and an upper bound (measured value + MoE).
MoE is affected by these three factors:
MoE formula
We will discuss how we use MoE later on 🙂
Our app is built using Kotlin, a powerful and developer-friendly language that ensures easy maintenance and scalability. It has different tasks or jobs, all inherited from a Parent's Job and assigned to an ID.
When running the app, all the jobs are instantiated and bound to a named dependency by Koin.
At runtime we have different job instances, and run the desired one using Gradle. Each job is scheduled and executed within a separate GitHub Action.The action builds and runs the app by specifying the job id and providing the necessary secrets and variables.
We have three main jobs running using Github actions, each with duties:
This tool continuously monitors the app's health status after submission to the stores until the release is completed or rolled out to all users. It halts a release if the crash-free sessions SLOs are unmet and fast-tracks a release if the SLI indicates the release is performing well. The cron job runs this tool every 15 minutes from Monday to Wednesday and every hour from Thursday to Sunday.
This job uses some components.
The process starts by using the Mobile Stores component to fetch app releases, which are then passed to the Monitoring Rule Executor. The Rule Executor has a list of monitor rules, iterating over the rules and applying each of them to the app release based on the rule's properties.
The release monitoring job is designed to be flexible, allowing new rules to be added easily for checking and validating releases. Currently, we have three rules in place:
The Crash-Free Sessions Halting Rule is a critical component of our app release monitoring system. It uses it’s subcomponents to fetch the Crashes and Sessions, calculated CFSR and makes a decision based on that. If the CFSR falls below the defined Service Level Objective (SLO), the release is automatically halted, or a warning is sent to our team via Slack.
It is dependent on some of the subcomponents:
The Crash-free use case fetches the number of crashes and sessions for the entire release cycle data from when the release is approved until the execution time of the query. Two subcomponents power the data collection:
It fetches crashes and sessions using the Bigquery Executer every 15 minutes, with a time range starting from the beginning of the day up to the current execution time and stores the results in the Firestore database. By doing this, we can aggregate the previously stored data with the latest query results from each run. This is due to the cost reduction, so we are trying to submit more minor queries to Bigquery. The cost of a Bigquery query depends on the volume of processed data. Since the job runs every 15 minutes, querying the data for the whole release period solely from Bigquery will increase the costs. So, with partial querying, we are decreasing the volume of the data that needs to be queried.
Crashes and Sessions are stored in different datasets.
The Release Halter component uses the MobileAppStore component to halt an in-progress release and then sends a Slack message to the main channel. The MobileAppStore uses Google Play publisher and iOS AppStore APIs to halt a release.
A halt may come after a warning, so we need to find the last message and send the Halt message as a reply.
Each time we halt a release, we save the information of the release, such as app version, Crash-free sessions SLO and SLI. This data is invaluable for debuguing and checking out the number of hot fixes we would have per platform during a month or a year.
Note: There is a big difference between Android and iOS when it comes to halting a release. Halting an Android app removes the faulty version from the Google Play Store making it inaccessible to users. On iOS, however, halting only prevents the auto-update of the problematic version, but it remains available for download from the App Store. The Release Alert Monitor functions similarly to the halter but only sends a warning message without pausing the release.
The SLA Provider plays a critical role in determining whether a release should be halted or allowed to proceed based on real-time data. The SLO provider continuously evaluates app stability by calculating the Margin of Error (MoE) based on the number of user sessions and crash data. This process helps determine if the app release is stable enough to proceed or if corrective action, such as halting the release, is required. Usually, we alert when the SLI reaches half of the halt SLO.
We use this rule to fast-forward a release and increase the rollout to 100% of the users. Like the previous rule, it runs every 15 minutes. We only check for fast-track once in the release cycle, which happens on the second day of the release.
We calculate the MoE using Crash-free session data for the whole release cycle. The previous job already saves the entire release cycle data, such as Crash-free sessions rate and number of sessions in Firebase DS, so we fetch it and use it to calculate the MoE. If we could fast-track, we would send a Slack message like the following:
To calculate the MoE, we use the crash-free sessions rate(CFSR) as sample proportion (P) and the number of sessions as sample size (N). Imagine we have calculated a CFSR of 99.1, and we have 5K sessions from the Big-query result. Lets consider the confidence level as 99%, so the Z-Score is 2.576. We also consider the fact that CFSR has a normal distribution in a considerably big sample size.
In theory, when the sample size is small, CFSR has a Poisson distribution since the occurrence of crashes is independent of each other and has no impact on each other. However, in our case, the sample size will increase quickly, so we can consider using the normal distribution as the base distribution.
Now let's calculate the confidence intervals,
So statistically, we need to infer that with a 99.9% confidence level, the current CFSR will be more than our SLO of 99.95 in a fully released app.
As you can see in the following image, the confidence interval, a range of min and max values, might intersect the SLO or be bigger and smaller than the SLO.
In our example, the upper bound is lower than our SLO, so this release is not a candidate for fast-tracking.
Here is our approach to a release:
The fast-track approach awards building a stabler app with CFSR safely above the SLO.
This rule doesn't act on the current release but only provides helpful information about its current status. It acts as a guardrail against false positives.
We might have crashes that happen on a specific app version or on a specific device and not affect all users. This might happen in the Android ecosystem due to the variety of devices and customised OSs per brand. In this case, there is usually a meaningful difference between crash-free sessions and crash-free user metrics, which will help us detect and maybe continue the release.
This tool provides a daily report at noon, detailing the crash-free session rate, current rollout percentage, and upcoming rollout percentage. It has a similar approach to the previous job but uses batch tables instead of real-time tables. This allows us to get a comprehensive overview of the release and assess how a fix or hotfix has impacted the overall health of the release.
The reports cover the data for both the last day and the whole release cycle. It also reports the top three crashes on each release. This increases visibility throughout the engineering department since everyone checks the channel and is notified of app releases. Most EMs and PMs also found these reports very handy.
This increases the rollout percentage by the defined values and reports the latest rollout percentage and the upcoming rollout update level in the channel. It only applies to Android since the iOS app’s rollout is increased automatically by the App Store, and we don't have control over it.
We publish the Android app in a staged rollout using Fastlane in GitHub actions. We run the job once per day at noon.
We heavily use the Android publisher SDK in this job. We first do the following checks:
If these two conditions are met, we fetch and increase the latest user fraction. The user fraction steps typically are 0.15, 0.75 and 1, so on the first day of release, we will release to 15% of users, on the second day of release to 75% and on the third day to 100% and complete the release.
If we find a hotfix, we apply the latest release user fraction. This is very helpful for improving the app release state.
Why don’t we have the iOS rollout update automation?
App Release Monitoring (ARM) has improved the app release process at GetYourGuide, bringing significant benefits:
By implementing ARM, we have not only streamlined its app release process but also enhanced the overall quality and stability of its mobile applications. This automation has been a great learning and development experience for app engineers at GetYourGuide as it frees up time and also is a new scope in the app development domain to bring innovative-driven and forward-thinking ideas to the table. As mobile applications continue to be crucial in the travel industry, systems like ARM will be key to maintaining a competitive edge through high-quality, frequently updated apps.