DataSets¶
Introduction¶
Beta-Recsys provides users a wide range of datasets for recommendation system training. For convenience, we preprocess a number of datasets for you to train, getting you rid of splitting them on you local machine. Also this framework provides users a set of useful interfaces for data split.
Usage¶
The following codes can automatically download the Movielens_100k dataset, and split it using the leave-one-out splitting strategy.
from beta_rec.datasets.movielens import Movielens_100k
dataset = Movielens_100k()
split_dataset = dataset.load_leave_one_out()
To clean the dataset by filtering before splitting, you can use the filtering parameters. E.g., to filter out users that have less then 30 items and items that have less then 15 records, you can run:
dataset = Movielens_100k(min_u_c=15, min_i_c=30)
with these filtering parameters showing as follows:
min_u_c: filter the items that were purchased by less than min_u_c users.
(default: :obj:`0`)
min_i_c: filter the users that have purchased by less than min_i_c items.
(default: :obj:`3`)
min_o_c: filter the users that have purchased by less than min_o_c orders.
(default: :obj:`0`)
By default, the testing set will sample 100 negative items to reduce the evaluation cost. To reduce the bias of certain negative items, each splitting strategy will generate 10 different validation and testing sets. You can also specify these parameters:
split_dataset = dataset.load_leave_one_out(n_test=15,n_negative=200)
If you want to use the data splits generate by our Beta team, you can specify the download parameter.
from beta_rec.datasets.movielens import Movielens_100k
dataset = Movielens_100k()
split_dataset = dataset.load_leave_one_out(download=True)
For some very large datasets, generating negative items can be time-costly. This feature can greatly reduce some repeated work, and provide a benchmarking.
Note: 25, Oct. 2020. Now the preprocessed splits of each dataset are a bit out-of-date, we will regenerate a new version as soon as possible.
Dataset Statistics¶
Here we present some basic staticstics for the datasets in our framework.
Dataset | Interactions | Baskets | Temporal |
---|---|---|---|
MovieLens-100K | ✔️ | ✖️ | ✔️ |
MovieLens-1M | ✔️ | ✖️ | ✔️ |
MovieLens-25M | ✔️ | ✖️ | ✔️ |
Last.FM | ✔️ | ✖️ | ✖️ |
Epinions | ✔️ | ✖️ | ✖️ |
Tafeng | ✔️ | ✖️ | ✔️ |
Dunnhumby | ✔️ | ✔️ | ✔️ |
Instacart | ✔️ | ✖️ | ✔️ |
citeulike-a | ✔️ | ✖️ | ✖️ |
citeulike-t | ✔️ | ✖️ | ✖️ |
HetRec MoiveLens | ✔️ | ✖️ | ✔️ |
HetRec Delicious | ✔️ | ✔️ | ✖️ |
HetRec LastFM | ✔️ | ✔️ | ✔️ |
Yelp | ✔️ | ✖️ | ✔️ |
Gowalla | ✔️ | ✖️ | ✔️ |
Yoochoose | ✔️ | ✖️ | ✔️ |
Diginetica | ✔️ | ✖️ | ✔️ |
Taobao | ✔️ | ✖️ | ✔️ |
Ali-mobile | ✔️ | ✖️ | ✔️ |
Retailrocket | ✔️ | ✖️ | ✔️ |
Amazon Reviews | ✔️ |
Because some split methods require a specific features, like random_basket
expect the dataset has a Basket column. Here we list all the split methods for each dataset.
The prerequisite for each split methods are:
leave_one_out
: noneleave_one_basket
: require a Basket column in datasetrandom
: nonerandom_basket
: require a Basket column in datasettemporal
: require a Timestamp(Temporal) column in datasettemporal_basket
: require a Timestamp(Temporal) and a Basket column in dataset
Dataset | leave_one_out | leave_one_basket | random | random_basket | temporal | temporal_basket |
---|---|---|---|---|---|---|
MovieLens-100K | ✔️ | ✖️ | ✔️ | ✖️ | ✔️ | ✖️ |
MovieLens-1M | ✔️ | ✖️ | ✔️ | ✖️ | ✔️ | ✖️ |
MovieLens-25M | ✔️ | ✖️ | ✔️ | ✖️ | ✖️ | |
Last.FM | ✔️ | ✖️ | ✔️ | ✖️ | ✖️ | ✖️ |
Epinions | ✔️ | ✖️ | ✔️ | ✖️ | ✖️ | ✖️ |
Tafeng | ✔️ | ✖️ | ✔️ | ✖️ | ✔️ | ✖️ |
Dunnhumby | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Instacart | ✔️ | ✖️ | ✔️ | ✖️ | ✔️ | ✖️ |
citeulike-a | ✔️ | ✖️ | ✔️ | ✖️ | ✖️ | ✖️ |
citeulike-t | ✔️ | ✖️ | ✔️ | ✖️ | ✖️ | ✖️ |
HetRec MoiveLens | ✔️ | ✖️ | ✔️ | ✖️ | ✔️ | ✖️ |
HetRec Delicious | ✔️ | ✔️ | ✔️ | ✖️ | ✖️ | ✖️ |
HetRec LastFM | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Yelp | ✔️ | ✖️ | ✔️ | ✖️ | ✖️ | |
Gowalla | ✔️ | ✖️ | ✔️ | ✖️ | ✖️ | |
Yoochoose | ✔️ | ✖️ | ✔️ | ✖️ | ✖️ | |
Diginetica | ✔️ | ✖️ | ✔️ | ✖️ | ✖️ | |
Taobao | ✔️ | ✖️ | ✔️ | ✖️ | ✖️ | |
Ali-mobile | ✔️ | ✖️ | ✔️ | ✖️ | ✖️ | |
Retailrocket | ✔️ | ✖️ | ✔️ | ✖️ | ✖️ | |
Amazon Reviews |
Also, we provide some information about the dataset content such as the number of items, users and so on. This may give you a brief view of the dataset.
Dataset | #Interactions | #User | #Item | #Rating | #Timestamp |
---|---|---|---|---|---|
MovieLens-100K | 100,000 | 943 | 1,682 | 5 | 49,282 |
MovieLens-1M | 1,000,209 | 6,040 | 3,706 | 5 | 458,455 |
MovieLens-25M | 25,000,095 | 162,541 | 59,047 | 10 | 20,115,267 |
Last.FM | 92,834 | 1,892 | 17,632 | 5,436 | 1 |
Epinions | 664,825 | 40,163 | 139,738 | 5 | 1 |
Tafeng | 464118 | 9238 | 7973 | 1 | 464118 |
Dunnhumby | 2595732 | 2500 | 92339 | 1 | 2595732 |
Instacart | 33,819,106 | 206,209 | 49,685 | 1 | 3,346,083 |
citeulike-a | 204,986 | 240 | 16,980 | 1 | 1 |
citeulike-t | 134,860 | 216 | 25,584 | 1 | 1 |
HetRec MoiveLens | 855,598 | 2,113 | 10,109 | 10 | 809,328 |
HetRec Delicious | 437,593 | 1,867 | 69,223 | 1 | 104,093 |
HetRec LastFM | 186,479 | 1,892 | 12,523 | 1 | 9,749 |
Yelp | 8,021,122 | 1,968,703 | 209,393 | 5 | 7,853,102 |
Gowalla | 6,442,892 | 107,092 | 1,280,969 | 1 | 5,561,957 |
Yoochoose | 1,150,753 | 509,696 | 735 | 1 | 19,949 |
Diginetica | 1,235,380 | 310,324 | 122,993 | 1 | 152 |
Taobao | 3,835,331 | 37,376 | 930,607 | 1 | 698,889 |
Ali-mobile | 12,256,906 | 10,000 | 2,876,947 | 1 | 1 |
Retailrocket | 2,756,101 | 1,407,58 | 235,061 | 1 | 2,749,921 |
Amazon Reviews -- Amazon Instant Video | 583,933 | 426,922 | 23,965 | 5 | 3,027 |
Amazon Reviews -- Musical Instruments | 500,176 | 339,231 | 83,046 | 5 | 5,339 |
Amazon Reviews -- Digital Music | 836,006 | 478,235 | 266,414 | 5 | 5,941 |
Amazon Reviews -- Baby | 915,446 | 531,890 | 64,426 | 5 | 4,869 |
Amazon Reviews -- Grocery and Gourmet Food | 1,297,156 | 768,438 | 166,049 | 5 | 3,831 |
Amazon Reviews -- Patio, Lawn and Garden | 993,490 | 714,791 | 105,984 | 5 | 4,929 |
Amazon Reviews -- Automotive | 1,373,768 | 851,418 | 320,112 | 5 | 3,704 |
Amazon Reviews -- Pet Supplies | 1,235,316 | 740,985 | 103,288 | 5 | 3,900 |
Amazon Reviews -- Cell Phones and Accessories | 3,447,249 | 2,261,045 | 319,678 | 5 | 4,724 |
Amazon Reviews -- Health and Personal Care | 2,982,326 | 1,851,132 | 252,331 | 5 | 4,733 |
Amazon Reviews -- Toys and Games | 2,252,771 | 1,342,911 | 327,698 | 5 | 5,151 |
Amazon Reviews -- Video Games | 1,324,753 | 826,767 | 50,210 | 5 | 5,396 |
Amazon Reviews -- Tools and Home Improvement | 1,926,047 | 1,212,468 | 260,659 | 5 | 5,366 |
Amazon Reviews -- Beauty | 2,023,070 | 1,210,271 | 249,274 | 5 | 4,231 |
Amazon Reviews -- Apps for Android | 2,638,173 | 1,323,884 | 61,275 | 5 | 1,283 |
Amazon Reviews -- Office Products | 1,243,186 | 909,314 | 130,006 | 5 | 5,400 |
Amazon Reviews -- Sports And Outdoors | 3,268,695 | 1,990,521 | 478,898 | 5 | 4,786 |
Amazon Reviews -- Kindle Store | 3205467 | 1,406,890 | 1,406,890 | 5 | 3,328 |
Amazon Reviews -- Home And Kitchen | 4,253,926 | 2,511,610 | 410,243 | 5 | 5,202 |
Amazon Reviews -- Clothing Shoes And Jewelry | 5,748,920 | 3,117,268 | 1,136,004 | 5 | 4,209 |
Amazon Reviews -- CDs And Vinyl | 3,749,004 | 1,578,597 | 486,360 | 5 | 6,041 |
Amazon Reviews -- Movies And TV | 4,607,047 | 2,088,620 | 200,941 | 5 | 6,004 |
Amazon Reviews -- Electronics | 7,824,482 | 4,201,696 | 476,002 | 5 | 5,489 |
Amazon Reviews -- Books | 22,507,155 | 8,026,324 | 2,330,066 | 5 | 6,296 |
Dataset Usage¶
Download Data¶
Beta-Recsys provides download interface for users to download different dataset. Here is an example:
import sys
import os
sys.path.append(os.path.abspath('.'))
from beta_rec.datasets.movielens import Movielens_1m
movielens_1m = Movielens_1m()
movielens_1m.download()
However, not every dataset could be downloaded directly with our framework. For some datasets, you will still have to download them manually. You are supposed to follow our tips to download and put the dataset in the correct folder in order to be detected by our framework.
Load Data¶
Downloading and preprocessing giant datasets may be a disturbing things, and in order to deal with this issue, we have preprocessed a wide range of datasets and stored the processed data in our remote server. Users can access them easily by using our load
function.
import sys
import os
sys.path.append(os.path.abspath('.'))
from beta_rec.datasets.movielens import Movielens_1m
movielens_1m = Movielens_1m()
movielens_1m.load_leave_one_out()
movielens_1m.load_random_split()
Due to storage limitation, we only store a copy of split data with default parameters. If you want a custom split, you’ll still have to split them on you local machine.
Make Data¶
Users can simply ignore these functions because when you use custom parameters in load
functions, it will automatically call make
functions. So you don’t need to care about this functions. We strongly recommend you to use load
function directly in most of you time.
Data Split¶
For users who are willing to split some datasets that are not covered by our framework, we still provide various methods to make it easy to split huge data, without caring the implementation details. There are 6 main methods for users to split data.
random_split¶
This method splits data into random train and test subsets.
This method will first shuffle all the data and then select a portion of records based on the given test_rate
randomly.
random_basket_split¶
This method will select a portion of baskets(one basket may cover more than one record) based on the given test_rate
randomly.
leave_one_out¶
This method will first rank all the records by time (if a timestamp column is provided), and then select the last record.
leave_one_basket¶
This method provides train/test indices to split data in train/test sets. Each sample is used once as a test set while the remaining samples form the training set.
This method will first rank all the records by time (if a timestamp column is provided), and then select the last basket.
Due to the high number of test sets this method can be very costly.
temporal_split¶
This method will first rank all the records by time (if a timestamp column is provided), and then select the last portion of records.
This splitting approach is for evaluating how well a model performs on segments drawn from the same time series but excluded from the training set.
temporal_basket_split¶
This method will first rank all the records by time (if a timestamp column is provided), and then select the last portion of baskets.
Disclaimer on Datasets¶
This is a utility library that downloads and prepares public datasets. We do not host or distribute these datasets, vouch for their quality or fairness, or claim that you have license to use the dataset. It is your responsibility to determine whether you have permission to use the dataset under the dataset’s license.
If you’re a dataset owner and wish to update any part of it (description, citation, etc.), or do not want your dataset to be included in this library, please get in touch through a GitHub issue. Thanks for your contribution to the RecSys community!
More¶
For any quesitons, please tell us by creating an issue or contact us by sending an email to recsys.beta@gmail.com. We will try to respond it as soon as possible.