Matrix Factorization

Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. The rows of the first matrix represent the latent user factors and the columns of the second matrix represent the latent item factors. The dot product of these two matrices approximates the original user-item interaction matrix. The latent factors are also known as embeddings and are typically of much lower dimensionality than the original user and item vectors. The latent factors are learned through an iterative process that minimizes the error between the dot product of the latent factors and the original user-item interaction matrix. The error is measured using a loss function such as mean squared error (MSE) or binary cross entropy (BCE). The loss function is minimized using gradient descent or one of its variants.

Singular Value Decomposition (SVD)

So the singular value decomposition comes from linear algebra, and it’s a way of breaking down a matrix into constituent parts. we can factorize it into three matrices. And this is called factorization because it works a lot like factoring numbers. You take 15, and you can factorize it into 3 and 5, such that you multiply 3 and 5 together, and you get 15.

\[ R=P \Sigma Q^{\mathrm{T}} \]

\(R\) is \(m \times n\) ratings matrix
\(P\) is \(m \times k\) user-feature affinity matrix
\(Q\) is \(n \times k\) item-feature relevance matrix
\(\Sigma\) is \(k \times k\) diagonal feature weight matrix
For linear algebra people: \(P\) and \(Q\) are orthogonal
Linear algebra guarantees this exists for any real \(R\)

latent features

Latent means not directly observable. The common use of the term in PCA and Factor Analysis is to reduce dimension of a large number of directly observable features into a smaller set of indirectly observable features.

SVD describes preference in terms of latent features
These features are learned from the rating data
Not necessarily interpretable
- Optimized for predictive power
Defines a shared vector space for users and items (feature space)
- Enables compact representation of each

Example using Superise library

import pandas as pd
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV

Importing data

GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org). The data sets were collected over various periods of time, depending on the size of the set.

We are using Small: 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.

Download: ml-latest-small.zip (size: 1 MB)

df = pd.read_csv ("https://raw.githubusercontent.com/singhsidhukuldeep/Recommendation-System/master/data/ratings.csv")

df.head()

	userId	movieId	rating	timestamp
0	1	1	4.0	964982703
1	1	3	4.0	964981247
2	1	6	4.0	964982224
3	1	47	5.0	964983815
4	1	50	5.0	964982931

df.tail()

	userId	movieId	rating	timestamp
100831	610	166534	4.0	1493848402
100832	610	168248	5.0	1493850091
100833	610	168250	5.0	1494273047
100834	610	168252	5.0	1493846352
100835	610	170875	3.0	1493846415

df.drop(['timestamp'], axis=1, inplace=True)
df.columns = ['userID', 'item', 'rating']

df.head()

	userID	item	rating
0	1	1	4.0
1	1	3	4.0
2	1	6	4.0
3	1	47	5.0
4	1	50	5.0

df.shape

(100836, 3)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   userID  100836 non-null  int64  
 1   item    100836 non-null  int64  
 2   rating  100836 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB

print('Dataset shape: {}'.format(df.shape))
print('-Dataset examples-')
print(df.iloc[::20000, :])

Dataset shape: (100836, 3)
-Dataset examples-
        userID  item  rating
0            1     1     4.0
20000      132  1079     3.5
40000      274  5621     2.0
60000      387  6748     3.0
80000      501    11     3.0
100000     610  6978     4.0

# To reduce the dimensionality of the dataset, we will filter out rarely rated movies and rarely rating users.

min_ratings = 5
filter_items = df['item'].value_counts() > min_ratings
filter_items = filter_items[filter_items].index.tolist()

min_user_ratings = 5
filter_users = df['userID'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()

df_new = df[(df['item'].isin(filter_items)) & (df['userID'].isin(filter_users))]
print('The original data frame shape:\t{}'.format(df.shape))
print('The new data frame shape:\t{}'.format(df_new.shape))

The original data frame shape:	(100836, 3)
The new data frame shape:	(88364, 3)

Surprise library

To load a dataset from a pandas dataframe, we will use the load_from_df() method, we will also need a Reader object, and the rating_scale parameter must be specified. The dataframe must have three columns, corresponding to the user ids, the item ids, and the ratings in this order. Each row thus corresponds to a given rating.

reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(df_new[['userID', 'item', 'rating']], reader)

Basic algorithms

With the Surprise library, we will benchmark the following algorithms

NormalPredictor

NormalPredictor algorithm predicts a random rating based on the distribution of the training set, which is assumed to be normal. This is one of the most basic algorithms that do not do much work.

BaselineOnly

BasiclineOnly algorithm predicts the baseline estimate for given user and item.

k-NN algorithms

KNNBasic

KNNBasic is a basic collaborative filtering algorithm.

KNNWithMeans

KNNWithMeans is basic collaborative filtering algorithm, taking into account the mean ratings of each user.

KNNWithZScore

KNNWithZScore is a basic collaborative filtering algorithm, taking into account the z-score normalization of each user.

KNNBaseline

KNNBaseline is a basic collaborative filtering algorithm taking into account a baseline rating.

Matrix Factorization-based algorithms

SVD

SVD algorithm is equivalent to Probabilistic Matrix Factorization (http://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf)

SVDpp

The SVDpp algorithm is an extension of SVD that takes into account implicit ratings.

NMF

NMF is a collaborative filtering algorithm based on Non-negative Matrix Factorization. It is very similar with SVD.

Slope One

Slope One is a straightforward implementation of the SlopeOne algorithm. (https://arxiv.org/abs/cs/0702144)

Co-clustering

Co-clustering is a collaborative filtering algorithm based on co-clustering (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.6458&rep=rep1&type=pdf)

We use rmse as our accuracy metric for the predictions.

benchmark = []
# Iterate over all algorithms

# algorithms = [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]
algorithms = [SVD(), KNNWithMeans(), CoClustering()]
# print ("Attempting: ", str(algorithms), '\n\n\n')

for algorithm in algorithms:
    # print("Starting: " ,str(algorithm))
    print("Starting: ",str(algorithm).split(' ')[0].split('.')[-1])
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=True)
    # results = cross_validate(algorithm, data, measures=['RMSE','MAE'], cv=3, verbose=False)
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp._append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    print("Done: " ,str(algorithm), "\n\n")

print ('\n\tDONE\n')

Starting:  SVD

Evaluating RMSE of algorithm SVD on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8677  0.8615  0.8622  0.8638  0.0028  
Fit time          0.59    0.67    0.64    0.63    0.03    
Test time         0.43    0.19    0.19    0.27    0.11    
Done:  <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7f3277689250> 


Starting:  KNNWithMeans
Computing the msd similarity matrix...
Done computing similarity matrix.

Computing the msd similarity matrix...
Done computing similarity matrix.

Computing the msd similarity matrix...
Done computing similarity matrix.

Evaluating RMSE of algorithm KNNWithMeans on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.8709  0.8735  0.8615  0.8686  0.0052  
Fit time          0.07    0.09    0.08    0.08    0.01    
Test time         1.41    1.44    1.44    1.43    0.02    
Done:  <surprise.prediction_algorithms.knns.KNNWithMeans object at 0x7f32953be090> 


Starting:  CoClustering

Evaluating RMSE of algorithm CoClustering on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.9203  0.9108  0.9216  0.9176  0.0048  
Fit time          1.26    1.37    1.27    1.30    0.05    
Test time         0.12    0.12    0.13    0.12    0.00    
Done:  <surprise.prediction_algorithms.co_clustering.CoClustering object at 0x7f3277688f90> 



	DONE

surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')

surprise_results

	test_rmse	fit_time	test_time
Algorithm
SVD	0.863810	0.632358	0.269530
KNNWithMeans	0.868612	0.081368	1.427201
CoClustering	0.917562	1.301500	0.124395

SVDpp is performing best but it is taking a lot of time so we will use SED instean but apply GridSearch CV.

# param_grid = {
#     "n_epochs": [5, 10, 15, 20, 30, 40, 50, 100],
#     "lr_all": [0.001, 0.002, 0.005],
#     "reg_all": [0.02, 0.08, 0.4, 0.6]
# }

# smaller grid for testing
param_grid = {
    "n_epochs": [10, 20],
    "lr_all": [0.002, 0.005],
    "reg_all": [0.02]
}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], refit=True, cv=5)

gs.fit(data)

training_parameters = gs.best_params["rmse"]

print("BEST RMSE: \t", gs.best_score["rmse"])
print("BEST MAE: \t", gs.best_score["mae"])
print("BEST params: \t", gs.best_params["rmse"])

BEST RMSE: 	 0.8558621760774761
BEST MAE: 	 0.6563747332654953
BEST params: 	 {'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}

from datetime import datetime
print(training_parameters)
reader = Reader(rating_scale=(1, 5))

print("\n\n\t\t STARTING\n\n")
start = datetime.now()

print("> Loading data...")
data = Dataset.load_from_df(df_new[['userID', 'item', 'rating']], reader)
print("> OK")

print("> Creating trainset...")
trainset = data.build_full_trainset()
print("> OK")


startTraining = datetime.now()
print("> Training...")

algo = SVD(n_epochs = training_parameters['n_epochs'], lr_all = training_parameters['lr_all'], reg_all = training_parameters['reg_all'])

algo.fit(trainset)

endTraining = datetime.now()
print("> OK \t\t It Took: ", (endTraining-startTraining).seconds, "seconds")

end = datetime.now()
print (">> DONE \t\t It Took", (end-start).seconds, "seconds" )

{'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}


		 STARTING


> Loading data...
> OK
> Creating trainset...
> OK
> Training...

> OK 		 It Took:  0 seconds
>> DONE 		 It Took 0 seconds

## SAVING TRAINED MODEL
# from surprise import dump
# import os
# model_filename = "./model.pickle"
# print (">> Starting dump")
# # Dump algorithm and reload it.
# file_name = os.path.expanduser(model_filename)
# dump.dump(file_name, algo=algo)
# print (">> Dump done")
# print(model_filename)

# ## LOAD SAVED MODEL
# def load_model(model_filename):
#     print (">> Loading dump")
#     from surprise import dump
#     import os
#     file_name = os.path.expanduser(model_filename)
#     _, loaded_model = dump.load(file_name)
#     print (">> Loaded dump")
#     return loaded_model

from pprint import pprint as pp
# model_filename = "./model.pickle"
def itemRating(user, item):
    uid = str(user)
    iid = str(item) 
    # loaded_model = load_model(model_filename)
    prediction = algo.predict(user, item, verbose=True)
    rating = prediction.est
    details = prediction.details
    uid = prediction.uid
    iid = prediction.iid
    true = prediction.r_ui
    ret = {
        'user': user, 
        'item': item, 
        'rating': rating, 
        'details': details,
        'uid': uid,
        'iid': iid,
        'true': true
        }
    pp (ret)
    print ('\n\n')
    return ret
print(itemRating(user = "610", item = "10"))

user: 610        item: 10         r_ui = None   est = 3.54   {'was_impossible': False}
{'details': {'was_impossible': False},
 'iid': '10',
 'item': '10',
 'rating': 3.543813091304151,
 'true': None,
 'uid': '610',
 'user': '610'}



{'user': '610', 'item': '10', 'rating': 3.543813091304151, 'details': {'was_impossible': False}, 'uid': '610', 'iid': '10', 'true': None}