Matrix Factorization
Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. The rows of the first matrix represent the latent user factors and the columns of the second matrix represent the latent item factors. The dot product of these two matrices approximates the original user-item interaction matrix. The latent factors are also known as embeddings and are typically of much lower dimensionality than the original user and item vectors. The latent factors are learned through an iterative process that minimizes the error between the dot product of the latent factors and the original user-item interaction matrix. The error is measured using a loss function such as mean squared error (MSE) or binary cross entropy (BCE). The loss function is minimized using gradient descent or one of its variants.
Singular Value Decomposition (SVD)
So the singular value decomposition comes from linear algebra, and it’s a way of breaking down a matrix into constituent parts. we can factorize it into three matrices. And this is called factorization because it works a lot like factoring numbers. You take 15, and you can factorize it into 3 and 5, such that you multiply 3 and 5 together, and you get 15.
\(R\) is \(m \times n\) ratings matrix
\(P\) is \(m \times k\) user-feature affinity matrix
\(Q\) is \(n \times k\) item-feature relevance matrix
\(\Sigma\) is \(k \times k\) diagonal feature weight matrix
For linear algebra people: \(P\) and \(Q\) are orthogonal
Linear algebra guarantees this exists for any real \(R\)
latent features
Latent means not directly observable. The common use of the term in PCA and Factor Analysis is to reduce dimension of a large number of directly observable features into a smaller set of indirectly observable features.
SVD describes preference in terms of latent features
These features are learned from the rating data
Not necessarily interpretable
Optimized for predictive power
Defines a shared vector space for users and items (feature space)
Enables compact representation of each
Example using Superise library
import pandas as pd
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV
Importing data
GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org). The data sets were collected over various periods of time, depending on the size of the set.
We are using Small: 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.
Download: ml-latest-small.zip (size: 1 MB)
df = pd.read_csv ("https://raw.githubusercontent.com/singhsidhukuldeep/Recommendation-System/master/data/ratings.csv")
df.head()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
0 | 1 | 1 | 4.0 | 964982703 |
1 | 1 | 3 | 4.0 | 964981247 |
2 | 1 | 6 | 4.0 | 964982224 |
3 | 1 | 47 | 5.0 | 964983815 |
4 | 1 | 50 | 5.0 | 964982931 |
df.tail()
userId | movieId | rating | timestamp | |
---|---|---|---|---|
100831 | 610 | 166534 | 4.0 | 1493848402 |
100832 | 610 | 168248 | 5.0 | 1493850091 |
100833 | 610 | 168250 | 5.0 | 1494273047 |
100834 | 610 | 168252 | 5.0 | 1493846352 |
100835 | 610 | 170875 | 3.0 | 1493846415 |
df.drop(['timestamp'], axis=1, inplace=True)
df.columns = ['userID', 'item', 'rating']
df.head()
userID | item | rating | |
---|---|---|---|
0 | 1 | 1 | 4.0 |
1 | 1 | 3 | 4.0 |
2 | 1 | 6 | 4.0 |
3 | 1 | 47 | 5.0 |
4 | 1 | 50 | 5.0 |
df.shape
(100836, 3)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 userID 100836 non-null int64
1 item 100836 non-null int64
2 rating 100836 non-null float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB
print('Dataset shape: {}'.format(df.shape))
print('-Dataset examples-')
print(df.iloc[::20000, :])
Dataset shape: (100836, 3)
-Dataset examples-
userID item rating
0 1 1 4.0
20000 132 1079 3.5
40000 274 5621 2.0
60000 387 6748 3.0
80000 501 11 3.0
100000 610 6978 4.0
# To reduce the dimensionality of the dataset, we will filter out rarely rated movies and rarely rating users.
min_ratings = 5
filter_items = df['item'].value_counts() > min_ratings
filter_items = filter_items[filter_items].index.tolist()
min_user_ratings = 5
filter_users = df['userID'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()
df_new = df[(df['item'].isin(filter_items)) & (df['userID'].isin(filter_users))]
print('The original data frame shape:\t{}'.format(df.shape))
print('The new data frame shape:\t{}'.format(df_new.shape))
The original data frame shape: (100836, 3)
The new data frame shape: (88364, 3)
Surprise library
To load a dataset from a pandas dataframe, we will use the load_from_df() method, we will also need a Reader object, and the rating_scale parameter must be specified. The dataframe must have three columns, corresponding to the user ids, the item ids, and the ratings in this order. Each row thus corresponds to a given rating.
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df_new[['userID', 'item', 'rating']], reader)
Basic algorithms
With the Surprise library, we will benchmark the following algorithms
NormalPredictor
NormalPredictor algorithm predicts a random rating based on the distribution of the training set, which is assumed to be normal. This is one of the most basic algorithms that do not do much work.
BaselineOnly
BasiclineOnly algorithm predicts the baseline estimate for given user and item.
k-NN algorithms
KNNBasic
KNNBasic is a basic collaborative filtering algorithm.
KNNWithMeans
KNNWithMeans is basic collaborative filtering algorithm, taking into account the mean ratings of each user.
KNNWithZScore
KNNWithZScore is a basic collaborative filtering algorithm, taking into account the z-score normalization of each user.
KNNBaseline
KNNBaseline is a basic collaborative filtering algorithm taking into account a baseline rating.
Matrix Factorization-based algorithms
SVD
SVD algorithm is equivalent to Probabilistic Matrix Factorization (http://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf)
SVDpp
The SVDpp algorithm is an extension of SVD that takes into account implicit ratings.
NMF
NMF is a collaborative filtering algorithm based on Non-negative Matrix Factorization. It is very similar with SVD.
Slope One
Slope One is a straightforward implementation of the SlopeOne algorithm. (https://arxiv.org/abs/cs/0702144)
Co-clustering
Co-clustering is a collaborative filtering algorithm based on co-clustering (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.6458&rep=rep1&type=pdf)
We use rmse as our accuracy metric for the predictions.
benchmark = []
# Iterate over all algorithms
# algorithms = [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]
algorithms = [SVD(), KNNWithMeans(), CoClustering()]
# print ("Attempting: ", str(algorithms), '\n\n\n')
for algorithm in algorithms:
# print("Starting: " ,str(algorithm))
print("Starting: ",str(algorithm).split(' ')[0].split('.')[-1])
# Perform cross validation
results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=True)
# results = cross_validate(algorithm, data, measures=['RMSE','MAE'], cv=3, verbose=False)
# Get results & append algorithm name
tmp = pd.DataFrame.from_dict(results).mean(axis=0)
tmp = tmp._append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
benchmark.append(tmp)
print("Done: " ,str(algorithm), "\n\n")
print ('\n\tDONE\n')
Starting: SVD
Evaluating RMSE of algorithm SVD on 3 split(s).
Fold 1 Fold 2 Fold 3 Mean Std
RMSE (testset) 0.8575 0.8708 0.8664 0.8649 0.0055
Fit time 0.64 0.65 0.63 0.64 0.01
Test time 0.19 0.19 0.19 0.19 0.00
Done: <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7f287e4e78e0>
Starting: KNNWithMeans
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNWithMeans on 3 split(s).
Fold 1 Fold 2 Fold 3 Mean Std
RMSE (testset) 0.8616 0.8734 0.8736 0.8695 0.0056
Fit time 0.07 0.09 0.08 0.08 0.01
Test time 1.66 1.66 1.66 1.66 0.00
Done: <surprise.prediction_algorithms.knns.KNNWithMeans object at 0x7f287e4e63b0>
Starting: CoClustering
Evaluating RMSE of algorithm CoClustering on 3 split(s).
Fold 1 Fold 2 Fold 3 Mean Std
RMSE (testset) 0.9178 0.9188 0.9192 0.9186 0.0006
Fit time 1.06 1.08 1.09 1.08 0.01
Test time 0.21 0.14 0.21 0.19 0.03
Done: <surprise.prediction_algorithms.co_clustering.CoClustering object at 0x7f287e4e6320>
DONE
surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')
surprise_results
test_rmse | fit_time | test_time | |
---|---|---|---|
Algorithm | |||
SVD | 0.864887 | 0.639061 | 0.187395 |
KNNWithMeans | 0.869510 | 0.081105 | 1.659812 |
CoClustering | 0.918592 | 1.075265 | 0.187301 |
SVDpp is performing best but it is taking a lot of time so we will use SED instean but apply GridSearch CV.
# param_grid = {
# "n_epochs": [5, 10, 15, 20, 30, 40, 50, 100],
# "lr_all": [0.001, 0.002, 0.005],
# "reg_all": [0.02, 0.08, 0.4, 0.6]
# }
# smaller grid for testing
param_grid = {
"n_epochs": [10, 20],
"lr_all": [0.002, 0.005],
"reg_all": [0.02]
}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], refit=True, cv=5)
gs.fit(data)
training_parameters = gs.best_params["rmse"]
print("BEST RMSE: \t", gs.best_score["rmse"])
print("BEST MAE: \t", gs.best_score["mae"])
print("BEST params: \t", gs.best_params["rmse"])
BEST RMSE: 0.855942316679972
BEST MAE: 0.6564051163054095
BEST params: {'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}
from datetime import datetime
print(training_parameters)
reader = Reader(rating_scale=(1, 5))
print("\n\n\t\t STARTING\n\n")
start = datetime.now()
print("> Loading data...")
data = Dataset.load_from_df(df_new[['userID', 'item', 'rating']], reader)
print("> OK")
print("> Creating trainset...")
trainset = data.build_full_trainset()
print("> OK")
startTraining = datetime.now()
print("> Training...")
algo = SVD(n_epochs = training_parameters['n_epochs'], lr_all = training_parameters['lr_all'], reg_all = training_parameters['reg_all'])
algo.fit(trainset)
endTraining = datetime.now()
print("> OK \t\t It Took: ", (endTraining-startTraining).seconds, "seconds")
end = datetime.now()
print (">> DONE \t\t It Took", (end-start).seconds, "seconds" )
{'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}
STARTING
> Loading data...
> OK
> Creating trainset...
> OK
> Training...
> OK It Took: 0 seconds
>> DONE It Took 0 seconds
## SAVING TRAINED MODEL
# from surprise import dump
# import os
# model_filename = "./model.pickle"
# print (">> Starting dump")
# # Dump algorithm and reload it.
# file_name = os.path.expanduser(model_filename)
# dump.dump(file_name, algo=algo)
# print (">> Dump done")
# print(model_filename)
# ## LOAD SAVED MODEL
# def load_model(model_filename):
# print (">> Loading dump")
# from surprise import dump
# import os
# file_name = os.path.expanduser(model_filename)
# _, loaded_model = dump.load(file_name)
# print (">> Loaded dump")
# return loaded_model
from pprint import pprint as pp
# model_filename = "./model.pickle"
def itemRating(user, item):
uid = str(user)
iid = str(item)
# loaded_model = load_model(model_filename)
prediction = algo.predict(user, item, verbose=True)
rating = prediction.est
details = prediction.details
uid = prediction.uid
iid = prediction.iid
true = prediction.r_ui
ret = {
'user': user,
'item': item,
'rating': rating,
'details': details,
'uid': uid,
'iid': iid,
'true': true
}
pp (ret)
print ('\n\n')
return ret
print(itemRating(user = "610", item = "10"))
user: 610 item: 10 r_ui = None est = 3.54 {'was_impossible': False}
{'details': {'was_impossible': False},
'iid': '10',
'item': '10',
'rating': 3.543813091304151,
'true': None,
'uid': '610',
'user': '610'}
{'user': '610', 'item': '10', 'rating': 3.543813091304151, 'details': {'was_impossible': False}, 'uid': '610', 'iid': '10', 'true': None}