Researchers Can Now Access Yahoo’s Largest Machine Learning Dataset

Yahoo announced Thursday the public release of what it called “the largest-ever machine learning dataset to the research community.” The dataset stands at a massive ~110B events (13.5TB uncompressed) of anonymized user-news item interaction data, collected by recording the user-news item interactions of about 20M users from February 2015 to May 2015.

The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate.

In addition to the interaction data, Yahoo is providing categorized demographic information (age range, gender, and generalized geographic data) for a subset of the anonymized users. On the item side, they are releasing the title, summary, and key-phrases of the pertinent news article.

The interaction data is timestamped with the relevant local time and also contains partial information about the device on which the user accessed the news feeds, which allows for interesting work in contextual recommendation and temporal data mining.