Yasamin Klingler, software engineer, will walk us through the challenges regarding the visual content and the solutions we use at GetYourGuide to improve the customer visual experience in our product and marketing channels.
At GetYourGuide, we manage millions of Images gathered from multiple sources such as our partners, tour providers, internal teams, and travelers. All these sources work towards the same goal, to bring the true essence of the activities to the travelers' eyes through accurate, aesthetically beautiful, and relevant images. At the same time, these images are used in different channels, each with their specific requirements.
The journey begins when one of the aforementioned sources tries to upload an image to our system. This image needs to be moderated and ensured that it is considered Safe For Work across various categories. Context is crucial; for instance, an image of a weapon might be suitable for a Museum of the Second World War tour while contextually incorrect for a Romantic Dinner Cruise tour.
Additionally, we rigorously scan the uploaded files for potential viruses and malware to ensure their safety before storing them on our side.
To optimize our storage and avoid a need for unnecessary deduplication, we aim to maintain a singular version of each image. For instance, If a supplier uploads a picture of the Eiffel Tower to the Eiffel Tower Summit Floor tour and the Eiffel Tower Summit Floor Ticket & Seine River Cruise, we do not want to store the same image per tour. For this reason, we would only keep one version of the image, referring to it with the hash of the image while keeping track of the references in which this image is being used.
Finally, once all the steps were successfully accomplished, the image lands on our object storage.
Recognition and processing
Now that we have the images uploaded, we employ a series of computer vision techniques to annotate, refine, filter, optimize, and prepare them to be served.
During the annotation stage, we initially leverage external APIs, such as Google Vision API to detect faces, landmarks, and many various tags available. Moreover, we train classification models internally for our company-specific concepts. For instance, our branding teams need to discern if an image aligns with our brand identity.
We generate multiple crop coordinates for every annotated image, focusing on the most significant elements. While we are able to serve the original version of the image in various sizes, the smart crops bring more value for smaller dimensions.
While hashing images prevents us from displaying identical ones, it doesn't necessarily eliminate near-identical or highly similar images. Take, for instance, a tour of The Last Supper. While showcasing various images of The Last Supper can enrich the tour description pages, using three nearly identical images for our marketing campaigns would be redundant and less effective. Thus, addressing image diversity and detecting similarities remain crucial tasks.
For this reason, we detect similar images and filter the very similar ones for marketing campaigns.
Example: Image Similarity Detection
Every image can be numerically represented by its pixel values. One intuitive approach to detecting image similarities is to create a matrix for each image and then compare these matrices using basic distance metrics like Euclidean or Hamming distance. While this method might seem straightforward and effective for a small dataset comprising a few hundred images, it falls short when faced with even minor image alterations, such as varied cropping or rotations in a dataset of millions of images.
A more sophisticated approach involves identifying an image's key features or points. For instance, when comparing two RGB images of size 224 x 224, a direct pixel comparison would involve assessing approximately 150K points from each image without guaranteeing an exact match. In contrast, by utilizing image descriptors, we focus on comparing the most relevant image features, such as edges.
There are several techniques to generate these descriptors, including Speeded Up Robust Features (SURF) and Scale-Invariant Feature Transform (SIFT). Fig 3 illustrates the SIFT-generated descriptors. However, for our use case, we leverage the intermediate layers of EfficientNet to extract feature maps, as it demonstrated significantly better performance for image similarity learning on our image dataset.
To identify similar images within our collection, we must compare them against one another. This begins with preprocessing steps, such as resizing images to a consistent dimension suitable for our model, converting them to grayscale, and generating TensorFlow features. Subsequently, we utilize EfficientNet to produce embeddings of these preprocessed images.
Once we have these embeddings, we can pinpoint the nearest neighbors for a given image. To achieve this, we employ Facebook AI Similarity Search (FAISS), a rapid method designed for searching embeddings in multimedia documents.
By integrating these techniques, we can not only identify similar images but also curate a diverse selection of images.
Throughout our journey at GetYourGuide, we consistently encounter unique challenges related to our visual content. This article endeavors to encapsulate a fraction of our initiatives, yet the potential in this domain is vast. As computer vision technology rapidly advances, leveraging cutting-edge technologies while maintaining their tangible business impact is crucial. Ultimately, our primary objective remains enhancing the visual experience of our travelers while enabling marketing channels to reach a broader audience and enable unforgettable travel experiences for as many travelers as possible.
I would like to thank Damien, Piotr, and Tündi for their collaboration in multiple stages of our image journey. To Joshua, Prateek, and Susanna for their contributions to this post.
How we Leverage Postgres for our Search Data Processing Pipeline at GetYourGuide
The Long and Winding Road to Short and Smooth Releases