In order to improve the deployment experience of the developers at GetYourGuide and to enable them to configure more sophisticated deployment strategies, the Developer Enablement team migrated their deployment tooling from Spinnaker to ArgoCD. Engineering Manager Harshal Shah discusses the lessons learned during the migration.
At GetYourGuide, our services are written in Java, PHP, Python, and Vue.js, and run in an Istio service mesh. We used Spinnaker to deploy these services to our production Kubernetes cluster. After running Spinnaker in production for more than two years, we saw the following challenges:
With these problems at hand we sought to find new solutions. Our initial search led us to two options, FluxCD and ArgoCD. After a further hands-on evaluation we chose to go with ArgoCD which has a wider community usage and seemed more aligned with our needs.
As we progressed through planning, preparation, and finally migration of our services to ArgoCD, we learned lessons which could be beneficial for potential new users. Our learnings are summarized below:
Installing ArgoCD on our test environment and managing test deployments with ArgoCD helped us get our users familiar with ArgoCD UI. This ensured that most users were not finding the UI overwhelming when their services were migrated in production.
We migrated our own services manually and used that opportunity to learn how to make the migration easy for our users. We were able to ensure that services that do not use progressive delivery (canary/smoke-tests) can be migrated with a one line pull-request. This allowed us to migrate around 300 services in two weeks.
We chose to migrate two of our most complex services as early adopters. This gave us two benefits:
We chose to migrate all our remaining services to Argo in just two weeks. We ran multiple onboarding sessions and Q&As beforehand to answer user questions, walk through day-to-day workflows, and had all members of our team pair with specific teams and migrate services that needed progressive delivery. This swift migration helped reduce confusion around which service is running on which platform.
In the course of our migration we came across certain shortcomings in upstream ArgoCD and Rollouts. We opened upstream issues and 35+ pull requests which were very promptly answered, reviewed, and accepted by the community, which helped us move forward a lot faster. Some important contributions include:
For applications with dozens of managed resources, the tree view would take a lot of time to render and hang. For this reason, changing the default view to a paginated list was a much better option. We added a change to allow applications to render a default view based on annotations.
We also had some hiccups along the way...
While migrating our namespace management automation from Spinnaker to ArgoCD, we ended up deleting some very important application namespaces which we were able to recover in an hour. We then decided to prevent namespace deletion by implementing a validation webhook that would prevent namespace deletion unless the namespace has an explicit label namespaces.gyg.io/protection=allow-delete
With multiple services using Argo Rollouts for progressive deliveries, there were many analysisruns happening in our cluster which were talking to Datadog. These API calls were ratelimited by Datadog and we had to request an increase in rate limit to mitigate this.
As a long term solution, we have made an upstream contribution to use Datadog v2 API which would reduce the chances of being rate limited.
We could not find an existing solution for SLOs on ArgoCD and Rollouts, so we have created our own. We have started a conversation with the community and shall contribute it upstream once it is approved.
These are early days, but so far we are happy with the migration to ArgoCD.
With these benefits, we look forward to enabling additional progressive deployment capabilities for our users.
Engineering Manager Series Part 6: Systems Health and How to Create a DevOps Culture
Turning Analytics Notebooks into Our Scalable Brand Pipeline
Growth Path for Engineers at GetYourGuide
How we Standardized Machine Learning Observability Across Teams