beta_rec.utils package¶
beta_rec.utils.alias_table module¶
-
class
beta_rec.utils.alias_table.
AliasTable
(obj_freq)[source]¶ Bases:
object
AliasTable Class.
A list of indices of tokens in the vocab following a power law distribution, used to draw negative samples.
-
sample
(count, obj_num=1, no_repeat=False)[source]¶ Generate samples.
Parameters: - count – the number of tokens in a draw.
- obj_num – the number of draws.
- no_repeat – whether repeat tokens are allowed in a single draw.
Returns: A list of tokens.
Raises: ValueError
– count is larger than vocab_size when no_repeat is True.
-
beta_rec.utils.common_util module¶
-
class
beta_rec.utils.common_util.
DictToObject
(dictionary)[source]¶ Bases:
object
Python dict to object.
-
beta_rec.utils.common_util.
ensureDir
(dir_path)[source]¶ Ensure a dir exist, otherwise create the path.
Parameters: dir_path (str) – the target dir.
-
beta_rec.utils.common_util.
get_data_frame_from_gzip_file
(path)[source]¶ Get Dataframe from a gzip file.
Parameters: the file path of gzip file. (path) – Returns: A dataframe extracted from the gzip file.
-
beta_rec.utils.common_util.
get_dataframe_from_npz
(data_file)[source]¶ Get the DataFrame from npz file.
Get the DataFrame from npz file.
Parameters: data_file (str or Path) – File path. Returns: the unzip data. Return type: DataFrame
-
beta_rec.utils.common_util.
get_random_rep
(raw_num, dim)[source]¶ Generate a random embedding from a normal (Gaussian) distribution.
Parameters: - raw_num – Number of raw to be generated.
- dim – The dimension of the embeddings.
Returns: ndarray or scalar. Drawn samples from the normal distribution.
-
beta_rec.utils.common_util.
normalized_adj_single
(adj)[source]¶ Missing docs.
Parameters: adj – Returns: None.
-
beta_rec.utils.common_util.
parse_gzip_file
(path)[source]¶ Parse gzip file.
Parameters: path – the file path of gzip file.
-
beta_rec.utils.common_util.
print_dict_as_table
(dic, tag=None, columns=['keys', 'values'])[source]¶ Print a dictionary as table.
Parameters: - dic (dict) – dict object to be formatted.
- tag (str) – A name for this dictionary.
- columns ([str,str]) – default [“keys”, “values”]. columns name for keys and values.
Returns: None
-
beta_rec.utils.common_util.
save_dataframe_as_npz
(data, data_file)[source]¶ Save DataFrame in compressed format.
Save and convert the DataFrame to npz file. :param data: DataFrame to be saved. :type data: DataFrame :param data_file: Target file path.
-
beta_rec.utils.common_util.
save_to_csv
(result, result_file)[source]¶ Save a result dict to disk.
Parameters: - result – The result dict to be saved.
- result_file – The file path to be saved.
-
beta_rec.utils.common_util.
set_seed
(seed)[source]¶ Initialize all the seed in the system.
Parameters: seed – A global random seed.
-
beta_rec.utils.common_util.
timeit
(method)[source]¶ Generate decorator for tracking the execution time for the specific method.
Parameters: method – The method need to timeit. - To use:
@timeit def method(self):
pass
Returns: None
-
beta_rec.utils.common_util.
un_zip
(file_name, target_dir=None)[source]¶ Unzip zip files.
Parameters: - file_name (str or Path) – zip file path.
- target_dir (str or Path) – target path to be save the unzipped files.
-
beta_rec.utils.common_util.
update_args
(config, args)[source]¶ Update config parameters by the received parameters from command line.
Parameters: - config (dict) – Initial dict of the parameters from JSON config file.
- args (object) – An argparse Argument object with attributes being the parameters to be updated.
beta_rec.utils.constants module¶
beta_rec.utils.download module¶
-
beta_rec.utils.download.
download_file
(url, store_file_path)[source]¶ Download the raw dataset file.
Download the dataset with the given url and save to the store_path.
Parameters: - url – the url that can be downloaded the dataset file.
- store_file_path – the path that stores the downloaded file.
Returns: the archive format of the suffix.
beta_rec.utils.evaluation module¶
-
class
beta_rec.utils.evaluation.
PandasHash
(pandas_object)[source]¶ Bases:
object
Wrapper class to allow pandas objects (DataFrames or Series) to be hashable.
-
pandas_object
¶
-
-
beta_rec.utils.evaluation.
auc
(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction')[source]¶ Calculate the Area-Under-Curve metric.
Calculate the Aread-Under-Curve metric for implicit feedback typed recommender, where rating is binary and prediction is float number ranging from 0 to 1.
https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve
Note
The evaluation does not require a leave-one-out scenario. This metric does not calculate group-based AUC which considers the AUC scores averaged across users. It is also not limited to k. Instead, it calculates the scores on the entire prediction results regardless the users.
Parameters: - rating_true (pd.DataFrame) – True data.
- rating_pred (pd.DataFrame) – Predicted data.
- col_user (str) – column name for user.
- col_item (str) – column name for item.
- col_rating (str) – column name for rating.
- col_prediction (str) – column name for prediction.
Returns: auc_score (min=0, max=1).
Return type: float
-
beta_rec.utils.evaluation.
check_column_dtypes
(func)[source]¶ Check columns of DataFrame inputs.
- This includes the checks on
- whether the input columns exist in the input DataFrames.
- whether the data types of col_user as well as col_item are matched in the two input DataFrames.
Parameters: func (function) – function that will be wrapped.
-
beta_rec.utils.evaluation.
exp_var
(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction')[source]¶ Calculate explained variance.
Parameters: - rating_true (pd.DataFrame) – True data. There should be no duplicate (userID, itemID) pairs.
- rating_pred (pd.DataFrame) – Predicted data. There should be no duplicate (userID, itemID) pairs.
- col_user (str) – column name for user.
- col_item (str) – column name for item.
- col_rating (str) – column name for rating.
- col_prediction (str) – column name for prediction.
Returns: Explained variance (min=0, max=1).
Return type: float
-
beta_rec.utils.evaluation.
get_top_k_items
(dataframe, col_user='col_user', col_rating='col_rating', k=10)[source]¶ Get the input customer-item-rating tuple in the format of Pandas.
DataFrame, output a Pandas DataFrame in the dense format of top k items for each user.
Note
if it is implicit rating, just append a column of constants to be ratings.
Parameters: - dataframe (pandas.DataFrame) – DataFrame of rating data (in the format
- customerID-itemID-rating) –
- col_user (str) – column name for user.
- col_rating (str) – column name for rating.
- k (int) – number of items for each user.
Returns: DataFrame of top k items for each user.
Return type: pd.DataFrame
-
beta_rec.utils.evaluation.
has_columns
(df, columns)[source]¶ Check if DataFrame has necessary columns.
Parameters: - df (pd.DataFrame) – DataFrame.
- columns (list(str) – columns to check for.
Returns: True if DataFrame has specified columns.
Return type: bool
-
beta_rec.utils.evaluation.
has_same_base_dtype
(df_1, df_2, columns=None)[source]¶ Check if specified columns have the same base dtypes across both DataFrames.
Parameters: - df_1 (pd.DataFrame) – first DataFrame.
- df_2 (pd.DataFrame) – second DataFrame.
- columns (list(str)) – columns to check, None checks all columns.
Returns: True if DataFrames columns have the same base dtypes.
Return type: bool
-
beta_rec.utils.evaluation.
logloss
(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction')[source]¶ Calculate the logloss metric.
Calculate the logloss metric for implicit feedback typed recommender, where rating is binary and prediction is float number ranging from 0 to 1.
https://en.wikipedia.org/wiki/Loss_functions_for_classification#Cross_entropy_loss_(Log_Loss)
Parameters: - rating_true (pd.DataFrame) – True data.
- rating_pred (pd.DataFrame) – Predicted data.
- col_user (str) – column name for user.
- col_item (str) – column name for item.
- col_rating (str) – column name for rating.
- col_prediction (str) – column name for prediction.
Returns: log_loss_score (min=-inf, max=inf).
Return type: float
-
beta_rec.utils.evaluation.
lru_cache_df
(maxsize, typed=False)[source]¶ Least-recently-used cache decorator.
Parameters: - maxsize (int|None) – max size of cache, if set to None cache is boundless.
- typed (bool) – arguments of different types are cached separately.
-
beta_rec.utils.evaluation.
mae
(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction')[source]¶ Calculate Mean Absolute Error.
Parameters: - rating_true (pd.DataFrame) – True data. There should be no duplicate (userID, itemID) pairs.
- rating_pred (pd.DataFrame) – Predicted data. There should be no duplicate (userID, itemID) pairs.
- col_user (str) – column name for user.
- col_item (str) – column name for item.
- col_rating (str) – column name for rating.
- col_prediction (str) – column name for prediction.
Returns: Mean Absolute Error.
Return type: float
-
beta_rec.utils.evaluation.
map_at_k
(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction', relevancy_method='top_k', k=10, threshold=10)[source]¶ Mean Average Precision at k.
The implementation of MAP is referenced from Spark MLlib evaluation metrics. https://spark.apache.org/docs/2.3.0/mllib-evaluation-metrics.html#ranking-systems
A good reference can be found at: http://web.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf
Note
1. The evaluation function is named as ‘MAP is at k’ because the evaluation class takes top k items for the prediction items. The naming is different from Spark. 2. The MAP is meant to calculate Avg. Precision for the relevant items, so it is normalized by the number of relevant items in the ground truth data, instead of k.
Parameters: - rating_true (pd.DataFrame) – True DataFrame.
- rating_pred (pd.DataFrame) – Predicted DataFrame.
- col_user (str) – column name for user.
- col_item (str) – column name for item.
- col_rating (str) – column name for rating.
- col_prediction (str) – column name for prediction.
- relevancy_method (str) – method for determining relevancy [‘top_k’, ‘by_threshold’].
- k (int) – number of top k items per user.
- threshold (float) – threshold of top items per user (optional).
Returns: MAP at k (min=0, max=1).
Return type: float
-
beta_rec.utils.evaluation.
merge_ranking_true_pred
(rating_true, rating_pred, col_user, col_item, col_rating, col_prediction, relevancy_method, k=10, threshold=10)[source]¶ Filter truth and prediction data frames on common users.
Parameters: - rating_true (pd.DataFrame) – True DataFrame.
- rating_pred (pd.DataFrame) – Predicted DataFrame.
- col_user (str) – column name for user.
- col_item (str) – column name for item.
- col_rating (str) – column name for rating.
- col_prediction (str) – column name for prediction.
- relevancy_method (str) – method for determining relevancy [‘top_k’, ‘by_threshold’].
- k (int) – number of top k items per user (optional).
- threshold (float) – threshold of top items per user (optional).
Returns: DataFrame of recommendation hits DataFrmae of hit counts vs actual relevant items per user number of unique user ids.
Return type: pd.DataFrame, pd.DataFrame, int
-
beta_rec.utils.evaluation.
merge_rating_true_pred
(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction')[source]¶ Join truth and prediction data frames on userID and itemID.
Joint truth and prediction DataFrames on userID and itemID and return the true and predicted rated with the correct index.
Parameters: - rating_true (pd.DataFrame) – True data.
- rating_pred (pd.DataFrame) – Predicted data.
- col_user (str) – column name for user.
- col_item (str) – column name for item.
- col_rating (str) – column name for rating.
- col_prediction (str) – column name for prediction.
Returns: Array with the true ratings. np.array: Array with the predicted ratings.
Return type: np.array
-
beta_rec.utils.evaluation.
ndcg_at_k
(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction', relevancy_method='top_k', k=10, threshold=10)[source]¶ Compute Normalized Discounted Cumulative Gain (nDCG).
Info: https://en.wikipedia.org/wiki/Discounted_cumulative_gain
Parameters: - rating_true (pd.DataFrame) – True DataFrame.
- rating_pred (pd.DataFrame) – Predicted DataFrame.
- col_user (str) – column name for user.
- col_item (str) – column name for item.
- col_rating (str) – column name for rating.
- col_prediction (str) – column name for prediction.
- relevancy_method (str) – method for determining relevancy [‘top_k’, ‘by_threshold’].
- k (int) – number of top k items per user.
- threshold (float) – threshold of top items per user (optional).
Returns: nDCG at k (min=0, max=1).
Return type: float
-
beta_rec.utils.evaluation.
precision_at_k
(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction', relevancy_method='top_k', k=10, threshold=10)[source]¶ Precision at K.
Note: We use the same formula to calculate precision@k as that in Spark. More details can be found at http://spark.apache.org/docs/2.1.1/api/python/pyspark.mllib.html#pyspark.mllib.evaluation.RankingMetrics.precisionAt In particular, the maximum achievable precision may be < 1, if the number of items for a user in rating_pred is less than k.
Parameters: - rating_true (pd.DataFrame) – True DataFrame.
- rating_pred (pd.DataFrame) – Predicted DataFrame.
- col_user (str) – column name for user.
- col_item (str) – column name for item.
- col_rating (str) – column name for rating.
- col_prediction (str) – column name for prediction.
- relevancy_method (str) – method for determining relevancy [‘top_k’, ‘by_threshold’].
- k (int) – number of top k items per user.
- threshold (float) – threshold of top items per user (optional).
Returns: precision at k (min=0, max=1).
Return type: float
-
beta_rec.utils.evaluation.
recall_at_k
(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction', relevancy_method='top_k', k=10, threshold=10)[source]¶ Recall at K.
Parameters: - rating_true (pd.DataFrame) – True DataFrame.
- rating_pred (pd.DataFrame) – Predicted DataFrame.
- col_user (str) – column name for user.
- col_item (str) – column name for item.
- col_rating (str) – column name for rating.
- col_prediction (str) – column name for prediction.
- relevancy_method (str) – method for determining relevancy [‘top_k’, ‘by_threshold’].
- k (int) – number of top k items per user.
- threshold (float) – threshold of top items per user (optional).
Returns: - recall at k (min=0, max=1). The maximum value is 1 even when fewer than
k items exist for a user in rating_true.
Return type: float
-
beta_rec.utils.evaluation.
rmse
(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction')[source]¶ Calculate Root Mean Squared Error.
Parameters: - rating_true (pd.DataFrame) – True data. There should be no duplicate (userID, itemID) pairs.
- rating_pred (pd.DataFrame) – Predicted data. There should be no duplicate (userID, itemID) pairs.
- col_user (str) – column name for user.
- col_item (str) – column name for item.
- col_rating (str) – column name for rating.
- col_prediction (str) – column name for prediction.
Returns: Root mean squared error.
Return type: float
-
beta_rec.utils.evaluation.
rsquared
(rating_true, rating_pred, col_user='col_user', col_item='col_item', col_rating='col_rating', col_prediction='col_prediction')[source]¶ Calculate R squared.
Parameters: - rating_true (pd.DataFrame) – True data. There should be no duplicate (userID, itemID) pairs.
- rating_pred (pd.DataFrame) – Predicted data. There should be no duplicate (userID, itemID) pairs.
- col_user (str) – column name for user.
- col_item (str) – column name for item.
- col_rating (str) – column name for rating.
- col_prediction (str) – column name for prediction.
Returns: R squared (min=0, max=1).
Return type: float
beta_rec.utils.logger module¶
-
class
beta_rec.utils.logger.
Logger
(filename='default', stdout=None, stderr=None)[source]¶ Bases:
object
Logger Class.
beta_rec.utils.monitor module¶
beta_rec.utils.onedrive module¶
-
class
beta_rec.utils.onedrive.
OneDrive
(url=None, path=None)[source]¶ Bases:
object
Download shared file/folder to localhost with persisted structure.
Download shared file/folder from OneDrive without authentication.
params: str:url: url to the shared one drive folder or file str:path: local filesystem path
methods: download() -> None: fire async download of all files found in URL
beta_rec.utils.seq_evaluation module¶
-
beta_rec.utils.seq_evaluation.
count_a_in_b_unique
(a, b)[source]¶ Count unique items.
Parameters: - a (List) – list of lists.
- b (List) – list of lists.
Returns: number of elements of a in b.
Return type: count (int)
-
beta_rec.utils.seq_evaluation.
mrr
(ground_truth, prediction)[source]¶ Compute Mean Reciprocal Rank metric. Reciprocal Rank is set 0 if no predicted item is in contained the ground truth.
Parameters: - ground_truth (List) – the ground truth set or sequence
- prediction (List) – the predicted set or sequence
Returns: the value of the metric
Return type: rr (float)
-
beta_rec.utils.seq_evaluation.
ndcg
(ground_truth, prediction)[source]¶ Compute Normalized Discounted Cumulative Gain (NDCG) metric.
Parameters: - ground_truth (List) – the ground truth set or sequence.
- prediction (List) – the predicted set or sequence.
Returns: the value of the metric.
Return type: ndcg (float)
-
beta_rec.utils.seq_evaluation.
precision
(ground_truth, prediction)[source]¶ Compute Precision metric.
Parameters: - ground_truth (List) – the ground truth set or sequence
- prediction (List) – the predicted set or sequence
Returns: the value of the metric
Return type: precision_score (float)
beta_rec.utils.triple_sampler module¶
beta_rec.utils.unigram_table module¶
Module contents¶
Utils Module.