Categories: Agency News

Yandex releases world’s largest event dataset for advancing recommender systems

India, June 2, 2025: Yandex has published Yambda (Yandex Music Billion-Interactions Dataset), the world’s largest currently available open dataset for recommender systems, containing nearly 5 billion anonymized user interactions with audio tracks from its music streaming platform, Yandex Music.

Yambda serves as a universal benchmark for testing new approaches and algorithms across all domains utilizing recommender systems — e-commerce, social networks, and short-form video platforms.

The dataset enables researchers to develop and test new recommender algorithms against its baseline models, accelerating innovation. Startups with limited data can leverage the dataset to build and test systems using Yambda before scaling. This accelerates the creation of advanced technologies tailored to business needs worldwide.

Bridging the research-industry gap

The quality and scale of training data are critical to delivering relevant recommendations on platforms like streaming services, social networks, short-form video apps, and e-commerce marketplaces. However, research in recommender systems has lagged behind rapidly advancing fields like large language models, largely due to limited access to large-scale datasets. Effective recommendation models require terabytes of behavioral data, which commercial platforms possess but rarely share publicly.

Researchers are often left with small, outdated datasets that fail to capture the complexity of modern usage: 

  • Spotify’s Million Playlists dataset is too small for commercial-scale recommender systems.
  • Netflix Prize dataset, with ~17,000 items and date-only timestamps, limits temporal modeling and large-scale research.
  • Criteo 1TB Click Logs dataset lacks proper documentation and identifiers, and focuses narrowly on ad clicks.

“Recommender systems are inherently tied to sensitive data. Companies can only publish recommender system datasets publicly after exhaustive anonymization, a resource-intensive process that’s slowed open innovation,” explains Nikolai Savushkin, Head of Recommender Systems at Yandex.

This data scarcity creates a gap: models that excel in academic settings often underperform in real-world applications. Efforts to integrate recommender systems with advanced architectures are also constrained by the lack of suitable training data.

About the Yambda dataset

Yambda addresses recommender system challenges by providing a massive, anonymized dataset from its music streaming service with ~28 million monthly users. This dataset provides insights into how users interact with the content offered by Yandex Music, which is known for its sophisticated recommendation system My Wave that tailors the listening experience to the tastes of each user. To protect privacy, all user and track data is anonymized, using numeric identifiers to meet privacy standards.

Key features of the dataset:

  • 4.79 billion anonymized user interactions collected over 10 months.
  • Data from 1 million users and anonymized descriptors for 9.39 million tracks.
  • Includes two feedback types: implicit interactions (listens) and explicit interactions (likes, dislikes, and their removal).
  • Offers audio embeddings (vector representations generated via convolutional neural networks) and anonymized information about tracks.
  • Features an “is_organic” flag marking whether users discovered tracks independently or through recommendations, enabling deeper behavioral analysis.
  • All events are timestamped, which supports the analysis of user behavior over time and allows models to be evaluated under conditions that closely resemble real-world use.

The dataset is released in Apache Parquet format, compatible with distributed processing systems such as Spark or Hadoop and analytical libraries like Pandas and Polars.

“Yambda empowers researchers to test innovative hypotheses and businesses to build smarter recommender systems. Ultimately, users benefit — finding the perfect song, product, or service effortlessly,” notes Nikolai Savushkin.

Dataset versions and evaluation

Available in three sizes — approximately 5 billion, 500 million, and 50 million events — the Yambda dataset accommodates researchers and developers with different needs and computational resource capacities. 

The dataset uses Global Temporal Split (GTS) for evaluation, a method that splits data by timestamps to preserve event sequences. Unlike Leave-One-Out, which removes the last positive interaction from each user’s history for testing, GTS avoids breaking temporal dependencies between training and test sets. This ensures a more realistic model testing — mimicking real-world conditions where future data is unavailable.

Baseline implementations include MostPop, DecayPop, ItemKNN, iALS, BPR, SANSA, and SASRec, providing benchmarks for comparing new recommender system approaches. These baselines are evaluated using standard metrics, including:

  • NDCG@k (ranking quality)
  • Recall@k (retrieval effectiveness)
  • Coverage@k (catalog diversity)

“When industry leaders share hard-won tools and data, a rising tide lifts all boats: researchers gain real-world benchmarks, startups access resources once reserved for tech giants, and users everywhere enjoy greater personalization,” added Nikolay Savushkin.

Yambda, the world’s largest open recommender system dataset, is now available on Hugging Face.

About Yandex

Yandex is a global technology company that builds intelligent products and services powered by machine learning. The company’s goal is to help consumers and businesses better navigate the online and offline world. Since 1997, Yandex has been delivering world-class, locally relevant search and information services and has also developed market-leading on-demand transportation services, navigation products, and other mobile applications for millions of consumers across the globe.

About My Wave

My Wave, a personalized recommendation system integrated into the multi-million-user music streaming service, Yandex Music, employs deep neural models and AI algorithms to analyze over a thousand factors — including user interactions, customizable mood/language settings, and real-time music analysis of spectrograms, frequency ranges, rhythm, vocal tone, and genre. By processing listening history and track sequences, it dynamically adapts to user preferences, identifies audio similarities, and predicts musical tastes to deliver tailored suggestions.

Published by

Recent Posts

World Champion Boxer Announces Retirement at Age 36

Image Name: Kell Brook retires from boxing Image Credit: The Times & The Sunday Times…

9 months ago

ABC expands ‘The View’ with new series ‘The Weekend View

Image Name: The Weekend View Set Image Credit: Deadline ABC, yet another age-old legacy, has…

9 months ago

Biden Cancels Italy Trip to Meet Pope Due to LA Wildfires

Image Name: Joe Biden With Pope Francis Image Credit: MSN It seems that a twist…

9 months ago

Justin Trudeau Resigns as Canada’s Prime Minister

Image Name: Canada PM Trudeau Image Credit: The Hindu In a surprising announcement, the Prime…

9 months ago

It’s Official: Selena Gomez Is in Her Bridal Style Era

Image Name: Selena Gomez Is in Her Bridal Style Image Credit: MSN Selena Gomez steps…

9 months ago

John Cena kicks off WWE farewell tour by announcing his first match

Image Name: John Cena to enter 2025 Royal Rumble Image Credit: USA Today WWE superstar…

9 months ago