Matrix Factorization

Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. The rows of the first matrix represent the latent user factors and the columns of the second matrix represent the latent item factors. The dot product of these two matrices approximates the original user-item interaction matrix. The latent factors are also known as embeddings and are typically of much lower dimensionality than the original user and item vectors. The latent factors are learned through an iterative process that minimizes the error between the dot product of the latent factors and the original user-item interaction matrix. The error is measured using a loss function such as mean squared error (MSE) or binary cross entropy (BCE). The loss function is minimized using gradient descent or one of its variants.

Singular Value Decomposition (SVD)

So the singular value decomposition comes from linear algebra, and it’s a way of breaking down a matrix into constituent parts. we can factorize it into three matrices. And this is called factorization because it works a lot like factoring numbers. You take 15, and you can factorize it into 3 and 5, such that you multiply 3 and 5 together, and you get 15.

\[ R=P \Sigma Q^{\mathrm{T}} \]
  • \(R\) is \(m \times n\) ratings matrix

  • \(P\) is \(m \times k\) user-feature affinity matrix

  • \(Q\) is \(n \times k\) item-feature relevance matrix

  • \(\Sigma\) is \(k \times k\) diagonal feature weight matrix

  • For linear algebra people: \(P\) and \(Q\) are orthogonal

  • Linear algebra guarantees this exists for any real \(R\)

latent features

Latent means not directly observable. The common use of the term in PCA and Factor Analysis is to reduce dimension of a large number of directly observable features into a smaller set of indirectly observable features.

  • SVD describes preference in terms of latent features

  • These features are learned from the rating data

  • Not necessarily interpretable

    • Optimized for predictive power

  • Defines a shared vector space for users and items (feature space)

    • Enables compact representation of each

Example using Superise library

import pandas as pd
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

Importing data

GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org). The data sets were collected over various periods of time, depending on the size of the set.

We are using Small: 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.

Download: ml-latest-small.zip (size: 1 MB)

df = pd.read_csv ("https://raw.githubusercontent.com/singhsidhukuldeep/Recommendation-System/master/data/ratings.csv")
df.head()
userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
df.tail()
userId movieId rating timestamp
100831 610 166534 4.0 1493848402
100832 610 168248 5.0 1493850091
100833 610 168250 5.0 1494273047
100834 610 168252 5.0 1493846352
100835 610 170875 3.0 1493846415
df.drop(['timestamp'], axis=1, inplace=True)
df.columns = ['userID', 'item', 'rating']
df.head()
userID item rating
0 1 1 4.0
1 1 3 4.0
2 1 6 4.0
3 1 47 5.0
4 1 50 5.0
df.shape
(100836, 3)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   userID  100836 non-null  int64  
 1   item    100836 non-null  int64  
 2   rating  100836 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB
print('Dataset shape: {}'.format(df.shape))
print('-Dataset examples-')
print(df.iloc[::20000, :])
Dataset shape: (100836, 3)
-Dataset examples-
        userID  item  rating
0            1     1     4.0
20000      132  1079     3.5
40000      274  5621     2.0
60000      387  6748     3.0
80000      501    11     3.0
100000     610  6978     4.0
# To reduce the dimensionality of the dataset, we will filter out rarely rated movies and rarely rating users.

min_ratings = 5
filter_items = df['item'].value_counts() > min_ratings
filter_items = filter_items[filter_items].index.tolist()

min_user_ratings = 5
filter_users = df['userID'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()

df_new = df[(df['item'].isin(filter_items)) & (df['userID'].isin(filter_users))]
print('The original data frame shape:\t{}'.format(df.shape))
print('The new data frame shape:\t{}'.format(df_new.shape))
The original data frame shape:	(100836, 3)
The new data frame shape:	(88364, 3)
Surprise library

To load a dataset from a pandas dataframe, we will use the load_from_df() method, we will also need a Reader object, and the rating_scale parameter must be specified. The dataframe must have three columns, corresponding to the user ids, the item ids, and the ratings in this order. Each row thus corresponds to a given rating.

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df_new[['userID', 'item', 'rating']], reader)
Basic algorithms

With the Surprise library, we will benchmark the following algorithms

NormalPredictor
  • NormalPredictor algorithm predicts a random rating based on the distribution of the training set, which is assumed to be normal. This is one of the most basic algorithms that do not do much work.

BaselineOnly
  • BasiclineOnly algorithm predicts the baseline estimate for given user and item.

k-NN algorithms
KNNBasic
  • KNNBasic is a basic collaborative filtering algorithm.

KNNWithMeans
  • KNNWithMeans is basic collaborative filtering algorithm, taking into account the mean ratings of each user.

KNNWithZScore
  • KNNWithZScore is a basic collaborative filtering algorithm, taking into account the z-score normalization of each user.

KNNBaseline
  • KNNBaseline is a basic collaborative filtering algorithm taking into account a baseline rating.

Matrix Factorization-based algorithms
SVD
  • SVD algorithm is equivalent to Probabilistic Matrix Factorization (http://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf)

SVDpp
  • The SVDpp algorithm is an extension of SVD that takes into account implicit ratings.

NMF
  • NMF is a collaborative filtering algorithm based on Non-negative Matrix Factorization. It is very similar with SVD.

Slope One
  • Slope One is a straightforward implementation of the SlopeOne algorithm. (https://arxiv.org/abs/cs/0702144)

Co-clustering
  • Co-clustering is a collaborative filtering algorithm based on co-clustering (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.6458&rep=rep1&type=pdf)

We use rmse as our accuracy metric for the predictions.

benchmark = []
# Iterate over all algorithms

# algorithms = [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]
algorithms = [SVD(), KNNWithMeans(), CoClustering()]
# print ("Attempting: ", str(algorithms), '\n\n\n')

for algorithm in algorithms:
    # print("Starting: " ,str(algorithm))
    print("Starting: ",str(algorithm).split(' ')[0].split('.')[-1])
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=True)
    # results = cross_validate(algorithm, data, measures=['RMSE','MAE'], cv=3, verbose=False)
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp._append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    print("Done: " ,str(algorithm), "\n\n")

print ('\n\tDONE\n')
Starting:  SVD
Evaluating RMSE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8677  0.8615  0.8622  0.8638  0.0028  
Fit time          0.59    0.67    0.64    0.63    0.03    
Test time         0.43    0.19    0.19    0.27    0.11    
Done:  <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7f3277689250> 


Starting:  KNNWithMeans
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Evaluating RMSE of algorithm KNNWithMeans on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8709  0.8735  0.8615  0.8686  0.0052  
Fit time          0.07    0.09    0.08    0.08    0.01    
Test time         1.41    1.44    1.44    1.43    0.02    
Done:  <surprise.prediction_algorithms.knns.KNNWithMeans object at 0x7f32953be090> 


Starting:  CoClustering
Evaluating RMSE of algorithm CoClustering on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9203  0.9108  0.9216  0.9176  0.0048  
Fit time          1.26    1.37    1.27    1.30    0.05    
Test time         0.12    0.12    0.13    0.12    0.00    
Done:  <surprise.prediction_algorithms.co_clustering.CoClustering object at 0x7f3277688f90> 



	DONE
surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')
surprise_results
test_rmse fit_time test_time
Algorithm
SVD 0.863810 0.632358 0.269530
KNNWithMeans 0.868612 0.081368 1.427201
CoClustering 0.917562 1.301500 0.124395

SVDpp is performing best but it is taking a lot of time so we will use SED instean but apply GridSearch CV.

# param_grid = {
#     "n_epochs": [5, 10, 15, 20, 30, 40, 50, 100],
#     "lr_all": [0.001, 0.002, 0.005],
#     "reg_all": [0.02, 0.08, 0.4, 0.6]
# }

# smaller grid for testing
param_grid = {
    "n_epochs": [10, 20],
    "lr_all": [0.002, 0.005],
    "reg_all": [0.02]
}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], refit=True, cv=5)

gs.fit(data)

training_parameters = gs.best_params["rmse"]

print("BEST RMSE: \t", gs.best_score["rmse"])
print("BEST MAE: \t", gs.best_score["mae"])
print("BEST params: \t", gs.best_params["rmse"])
BEST RMSE: 	 0.8558621760774761
BEST MAE: 	 0.6563747332654953
BEST params: 	 {'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}
from datetime import datetime
print(training_parameters)
reader = Reader(rating_scale=(1, 5))

print("\n\n\t\t STARTING\n\n")
start = datetime.now()

print("> Loading data...")
data = Dataset.load_from_df(df_new[['userID', 'item', 'rating']], reader)
print("> OK")

print("> Creating trainset...")
trainset = data.build_full_trainset()
print("> OK")


startTraining = datetime.now()
print("> Training...")

algo = SVD(n_epochs = training_parameters['n_epochs'], lr_all = training_parameters['lr_all'], reg_all = training_parameters['reg_all'])

algo.fit(trainset)

endTraining = datetime.now()
print("> OK \t\t It Took: ", (endTraining-startTraining).seconds, "seconds")

end = datetime.now()
print (">> DONE \t\t It Took", (end-start).seconds, "seconds" )
{'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}


		 STARTING


> Loading data...
> OK
> Creating trainset...
> OK
> Training...
> OK 		 It Took:  0 seconds
>> DONE 		 It Took 0 seconds
## SAVING TRAINED MODEL
# from surprise import dump
# import os
# model_filename = "./model.pickle"
# print (">> Starting dump")
# # Dump algorithm and reload it.
# file_name = os.path.expanduser(model_filename)
# dump.dump(file_name, algo=algo)
# print (">> Dump done")
# print(model_filename)
# ## LOAD SAVED MODEL
# def load_model(model_filename):
#     print (">> Loading dump")
#     from surprise import dump
#     import os
#     file_name = os.path.expanduser(model_filename)
#     _, loaded_model = dump.load(file_name)
#     print (">> Loaded dump")
#     return loaded_model
from pprint import pprint as pp
# model_filename = "./model.pickle"
def itemRating(user, item):
    uid = str(user)
    iid = str(item) 
    # loaded_model = load_model(model_filename)
    prediction = algo.predict(user, item, verbose=True)
    rating = prediction.est
    details = prediction.details
    uid = prediction.uid
    iid = prediction.iid
    true = prediction.r_ui
    ret = {
        'user': user, 
        'item': item, 
        'rating': rating, 
        'details': details,
        'uid': uid,
        'iid': iid,
        'true': true
        }
    pp (ret)
    print ('\n\n')
    return ret
print(itemRating(user = "610", item = "10"))
user: 610        item: 10         r_ui = None   est = 3.54   {'was_impossible': False}
{'details': {'was_impossible': False},
 'iid': '10',
 'item': '10',
 'rating': 3.543813091304151,
 'true': None,
 'uid': '610',
 'user': '610'}



{'user': '610', 'item': '10', 'rating': 3.543813091304151, 'details': {'was_impossible': False}, 'uid': '610', 'iid': '10', 'true': None}