lessons-learned-from-migrating-to-argocd
Engineering
Apr 27, 2023

Lessons Learned from Migrating to ArgoCD

Harshal Shah
Engineering Manager

In order to improve the deployment experience of the developers at GetYourGuide and to enable them to configure more sophisticated deployment strategies, the Developer Enablement team migrated their deployment tooling from Spinnaker to ArgoCD. Engineering Manager Harshal Shah discusses the lessons learned during the migration.

At GetYourGuide, our services are written in Java, PHP, Python, and Vue.js, and run in an Istio service mesh. We used Spinnaker to deploy these services to our production Kubernetes cluster. After running Spinnaker in production for more than two years, we saw the following challenges:

  • Spinnaker UI was very complicated for our users. We created our own wrapper command line tool to simplify rollbacks but the UX was not very good.
  • Our deployment pipelines, though abstracted for most services, were still complicated.
  • It was very difficult to customize canary configurations and have static baselines.

With these problems at hand we sought to find new solutions. Our initial search led us to two options, FluxCD and ArgoCD. After a further hands-on evaluation we chose to go with ArgoCD which has a wider community usage and seemed more aligned with our needs.

As we progressed through planning, preparation, and finally migration of our services to ArgoCD, we learned lessons which could be beneficial for potential new users. Our learnings are summarized below:

Familiarize users by giving them a sandbox environment

Installing ArgoCD on our test environment and managing test deployments with ArgoCD helped us get our users familiar with ArgoCD UI. This ensured that most users were not finding the UI overwhelming when their services were migrated in production.

Investing time in automating migrations paid off

We migrated our own services manually and used that opportunity to learn how to make the migration easy for our users. We were able to ensure that services that do not use progressive delivery (canary/smoke-tests) can be migrated with a one line pull-request. This allowed us to migrate around 300 services in two weeks.

Migrating our most complex services first helped build confidence

We chose to migrate two of our most complex services as early adopters. This gave us two benefits: 

  1. We could focus on them specifically and cater to any issues immediately without getting distracted
  2. Once our early adopters were migrated and stabilized, we gained enough confidence to be able to migrate all other services.

Swift migration helped reduce cognitive load for our users

We chose to migrate all our remaining services to Argo in just two weeks. We ran multiple onboarding sessions and Q&As beforehand to answer user questions, walk through day-to-day workflows, and had all members of our team pair with specific teams and migrate services that needed progressive delivery. This swift migration helped reduce confusion around which service is running on which platform. 

ArgoCD community is very active and welcoming

In the course of our migration we came across certain shortcomings in upstream ArgoCD and Rollouts. We opened upstream issues and 35+ pull requests which were very promptly answered, reviewed, and accepted by the community, which helped us move forward a lot faster. Some important contributions include:

  • We added a feature called rollback windows which would skip analysis if the target version is among the last N versions.
For applications with dozens of managed resources, the tree view would take a lot of time to render and hang. For this reason, changing the default view to a paginated list was a much better option. We added a change to allow applications to render a default view based on annotations.

We also had some hiccups along the way...

Argo deleted some very important namespaces

While migrating our namespace management automation from Spinnaker to ArgoCD, we ended up deleting some very important application namespaces which we were able to recover in an hour. We then decided to prevent namespace deletion by implementing a validation webhook that would prevent namespace deletion unless the namespace has an explicit label namespaces.gyg.io/protection=allow-delete 

AnalysisRuns causing HTTP 429 on Datadog

With multiple services using Argo Rollouts for progressive deliveries, there were many analysisruns happening in our cluster which were talking to Datadog. These API calls were ratelimited by Datadog and we had to request an increase in rate limit to mitigate this. 

As a long term solution, we have made an upstream contribution to use Datadog v2 API which would reduce the chances of being rate limited. 

SLOs on ArgoCD and Rollouts

We could not find an existing solution for SLOs on ArgoCD and Rollouts, so we have created our own. We have started a conversation with the community and shall contribute it upstream once it is approved. 

Conclusion

These are early days, but so far we are happy with the migration to ArgoCD. 

  • The internal components seem pretty robust and do not break. 
  • Our Spinnaker upgrades were always coupled with some initial instability. However, with Argo we have upgraded a few times and have not seen any issues so far. 
  • Argo logs are much more readable and in case of problems, it is easy to detect which component failed and how to fix it. 
  • Argo’s RBAC helps us ensure the right teams have privileges on services they own.
  • Argo Rollouts helps us provide flexible and user-friendly solutions for progressive delivery whether it is canaries, smoketests, or both. 

With these benefits, we look forward to enabling additional progressive deployment capabilities for our users.

Other articles from this series
No items found.

Featured roles

Marketing Executive
Berlin
Full-time / Permanent
Marketing Executive
Berlin
Full-time / Permanent
Marketing Executive
Berlin
Full-time / Permanent

Join the journey.

Our 800+ strong team is changing the way millions experience the world, and you can help.

Keep up to date with the latest news

Oops! Something went wrong while submitting the form.