GetYourGuide recently sponsored Germany’s preeminent conference for Pythonists – or those working with Python programming language. Here, Theodore Meynard, Data Science Manager for Activity Ranking, who also sat on the event’s steering committee, shares his highlights.
PyCon De PyData Berlin is a global platform for Python enthusiasts, developers, and industry professionals to come together. April 2023 marked the third time two landmark conferences – PyCon De and PyData Berlin – have joined forces, with this year’s event seeing some 1400 attendees take over Berlin Conference Center and more than 300 join online. The joint conference invited professionals and enthusiasts alike to explore the latest trends, advancements, and best practices in the Python ecosystem.
Like other PyCon and PyData conferences, the event was organized by volunteers and community sponsors. GetYourGuide is committed to supporting open source software, and for that reason we got involved on several levels. First and foremost, the company was a conference sponsor, with our support going towards helping this and other open source projects, thanks to the work of the NumFOCUS foundation.
Colleagues also lent practical support by way of volunteer roles in the running of the event. That includes Mihail Douhaniaris, Data Scientist for Activity Ranking, and Jean Machado, Data Science Manager for Growth Data Products, who took a leading role in the video committee. In addition, Theodore Meynard, Data Science Manager for Activity Ranking, served as the steering committee chair, and was responsible for the overall organization of the event.
Finally, because peer learning is an important cornerstone of our commitment to colleagues’ personal and professional development, we organized for a group of six engineers and data scientists to attend the conference. As well as being a great opportunity to meet peers from a range of industries based in Berlin and beyond, it was also a great learning experience.
GetYourGuide’s participation in PyCon De PyData Berlin was all about knowledge sharing. With that in mind, Theodore also spoke at the conference, delivering two talks around how GetYourGuide utilizes Python.
His first presentation was about MLOps in practice at GetYourGuide In this talk, he explained how his team de-risked the migration from batch to real-time inference by breaking it down into multiple steps, each adding value and providing learning opportunities. He also shared the challenges he and his team had with the new architecture and the key design decisions they made to overcome them. These included explicit ownership, extensive end-to-end testing with real data, and automating workflows for testing, training, and deployment as part of a continuous integration process.
Theodore’s second talk was titled Software Design Pattern for Data Science, and focused on the importance of refactoring, designing for cognitive capabilities, and testability in software design. Theodore discussed machine learning design patterns such as workflow pipelines, transform design, and feature stores. He also highlighted the need for careful consideration when implementing these patterns, as they can add complexity and should only be used when they solve a specific problem. These kinds of guidelines and tradeoffs are always top of mind when the data scientist and MLOps engineers are developing the data products at GetYourGuide.
The conference covered a wide range of topics and was a fantastic learning experience for the team. In particular, we were excited about these four core subject areas, all of which hold exciting potential for GetYourGuide’s data scientists, data analyst and MLOps engineers…
Hyperparameter optimization can be resource-intensive as it necessitates the complete training of the model for each configuration. However, this cost can be substantially lowered, albeit at a slight decrease in overall performance, by sampling data and using certain algorithms to select the subsequent configuration. Amazon Web Service (AWS) has made public a library replete with these algorithms that can be utilized for the effective evaluation of many hyperparameter configurations.
This was particularly of interest as GetYourGuide has a few projects that could use this library to optimize our models.
Code coverage is a commonly used metric for assessing the efficacy of our tests, but it needs to indicate how well our tests can detect unanticipated changes in the codebase. This is where mutation testing becomes useful. Mutation testing is a method that evaluates the robustness of your tests by creating modified versions of your code meant to challenge and potentially fail your tests. When these "mutant" versions of your code don't trigger test failures, it signals that such changes could potentially introduce bugs in a production environment, necessitating an upgrade of your tests to account for this. Mutmut, an open-source Python mutation testing system, can help with this process.
This is an area that has not been explored at GetYourGuide in the Data Product teams, but poses certain advantages that make them worth exploring.
An engineering solutions company had a challenge with sklearn and lightgbm models that carried a lot of unnecessary information for prediction, resulting in large model sizes, some bigger than 1GB, which were difficult to deploy. To solve this problem, they delved deep into the data and identified what could be removed from the pickle object to maintain a manageable size. Their efforts culminated in a size reduction to approximately 100MB, and the creation of an open-source library, slim-trees, to facilitate this process.
As the size of the models used in production is constantly increasing, it is something we might need to look into in the near future.
Transformers, the foundation of modern Language Learning Models (LLMs), typically impose a hard limit on the number of tokens you can use in your text. Practically, this means only about half a page of text can fit into the model, which is not ideal for some scenarios. Notably, most international benchmarks use text inputs smaller than 512 tokens, which does not encourage models to find alternative approaches. While methods exist to overcome the 512 token limit, most recent models don't implement them due to inertia. At GetYourGuide, there are some internal NLP use cases that would benefit from this additional context to improve the predictions of our models.
While the conference had many interesting presentations, a few particularly resonated with us. Here are our top talks from the event:
The talk explores "Learned DBMS Components 2.0," using data-driven learning to understand data distribution in DBMS, thus reducing training overhead. It also introduces zero-shot learning for tasks like physical cost estimation, not achievable with data-driven learning.
We liked it because it gives a short overview of a completely new application of machine learning which has some very interesting results already.
The speaker stresses the significance of clean, maintainable code in data science. He discusses common coding pitfalls, suggests improvements like meaningful names and shorter functions, and introduces a method for assessing code complexity via flake8 plugins, aiding teams in prioritizing code enhancements.
It’s an important, relevant topic - and a refresher is always good!
The talk playfully elucidates Modern Data Stack terms like "data mesh" and "reverse ETL", starts with a review of PyData Stack for ETL workflows, discusses self-serve analytics' challenges, and highlights the ETL to ELT shift.
This is an entertaining and informative talk, providing a good overview of available tools.
An inspiring talk exploring Python’s diverse applications across numerous industries, including the speaker’s personal experience of using Python in game development, telecommunications, fintech, and cybersecurity for everything from game creation to machine learning application.
The speaker explores Topological Data Analysis (TDA), teaching neural networks geometry and topology for improved data shape reasoning. With applications beyond computer vision, like enhanced embeddings learning, it's becoming popular beyond academia, aided by Python libraries that extend scikit-learn or PyTorch. It’s a really nice introduction to the way Topological Data Analysis and Geometric Deep Learning can be used to reason about the shape of data and help us handle complex data structures.
The three-day PyCon De PyData Berlin was invaluable in bringing to light the latest trends and advancements in Python. To deepdive into more of the topics discussed, you can check out all talks here.
Not only was the event a fantastic learning experience for the GetYourGuide team, it was also our pleasure to play such a key role in another successful conference – particularly with regard to fostering knowledge exchange and collaboration within the community. We look forward to partaking in future events for the data community to get together and share their learnings.
Data Distrust: How to Rebuild Confidence in Problematic Domains
PyData Berlin: How The Data Community Comes Together - Feb Edition