# Matrix Factorization

Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. The rows of the first matrix represent the latent user factors and the columns of the second matrix represent the latent item factors. The dot product of these two matrices approximates the original user-item interaction matrix. The latent factors are also known as embeddings and are typically of much lower dimensionality than the original user and item vectors. The latent factors are learned through an iterative process that minimizes the error between the dot product of the latent factors and the original user-item interaction matrix. The error is measured using a loss function such as mean squared error (MSE) or binary cross entropy (BCE). The loss function is minimized using gradient descent or one of its variants.

## Singular Value Decomposition (SVD)

So the singular value decomposition comes from linear algebra, and it’s a way of breaking down a matrix into constituent parts. we can factorize it into three matrices. And this is called factorization because it works a lot like factoring numbers. You take 15, and you can factorize it into 3 and 5, such that you multiply 3 and 5 together, and you get 15.

$R=P \Sigma Q^{\mathrm{T}}$
• $$R$$ is $$m \times n$$ ratings matrix

• $$P$$ is $$m \times k$$ user-feature affinity matrix

• $$Q$$ is $$n \times k$$ item-feature relevance matrix

• $$\Sigma$$ is $$k \times k$$ diagonal feature weight matrix

• For linear algebra people: $$P$$ and $$Q$$ are orthogonal

• Linear algebra guarantees this exists for any real $$R$$

### latent features

Latent means not directly observable. The common use of the term in PCA and Factor Analysis is to reduce dimension of a large number of directly observable features into a smaller set of indirectly observable features.

• SVD describes preference in terms of latent features

• These features are learned from the rating data

• Not necessarily interpretable

• Optimized for predictive power

• Defines a shared vector space for users and items (feature space)

• Enables compact representation of each

### Example using Superise library

import pandas as pd
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering
from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split
from surprise.model_selection import GridSearchCV


#### Importing data

GroupLens Research has collected and made available rating data sets from the MovieLens web site (http://movielens.org). The data sets were collected over various periods of time, depending on the size of the set.

We are using Small: 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users. Last updated 9/2018.

df = pd.read_csv ("https://raw.githubusercontent.com/singhsidhukuldeep/Recommendation-System/master/data/ratings.csv")

df.head()

userId movieId rating timestamp
0 1 1 4.0 964982703
1 1 3 4.0 964981247
2 1 6 4.0 964982224
3 1 47 5.0 964983815
4 1 50 5.0 964982931
df.tail()

userId movieId rating timestamp
100831 610 166534 4.0 1493848402
100832 610 168248 5.0 1493850091
100833 610 168250 5.0 1494273047
100834 610 168252 5.0 1493846352
100835 610 170875 3.0 1493846415
df.drop(['timestamp'], axis=1, inplace=True)
df.columns = ['userID', 'item', 'rating']

df.head()

userID item rating
0 1 1 4.0
1 1 3 4.0
2 1 6 4.0
3 1 47 5.0
4 1 50 5.0
df.shape

(100836, 3)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 3 columns):
#   Column  Non-Null Count   Dtype
---  ------  --------------   -----
0   userID  100836 non-null  int64
1   item    100836 non-null  int64
2   rating  100836 non-null  float64
dtypes: float64(1), int64(2)
memory usage: 2.3 MB

print('Dataset shape: {}'.format(df.shape))
print('-Dataset examples-')
print(df.iloc[::20000, :])

Dataset shape: (100836, 3)
-Dataset examples-
userID  item  rating
0            1     1     4.0
20000      132  1079     3.5
40000      274  5621     2.0
60000      387  6748     3.0
80000      501    11     3.0
100000     610  6978     4.0

# To reduce the dimensionality of the dataset, we will filter out rarely rated movies and rarely rating users.

min_ratings = 5
filter_items = df['item'].value_counts() > min_ratings
filter_items = filter_items[filter_items].index.tolist()

min_user_ratings = 5
filter_users = df['userID'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()

df_new = df[(df['item'].isin(filter_items)) & (df['userID'].isin(filter_users))]
print('The original data frame shape:\t{}'.format(df.shape))
print('The new data frame shape:\t{}'.format(df_new.shape))

The original data frame shape:	(100836, 3)
The new data frame shape:	(88364, 3)

##### Surprise library

To load a dataset from a pandas dataframe, we will use the load_from_df() method, we will also need a Reader object, and the rating_scale parameter must be specified. The dataframe must have three columns, corresponding to the user ids, the item ids, and the ratings in this order. Each row thus corresponds to a given rating.

reader = Reader(rating_scale=(1, 5))

##### Basic algorithms

With the Surprise library, we will benchmark the following algorithms

##### NormalPredictor
• NormalPredictor algorithm predicts a random rating based on the distribution of the training set, which is assumed to be normal. This is one of the most basic algorithms that do not do much work.

##### BaselineOnly
• BasiclineOnly algorithm predicts the baseline estimate for given user and item.

##### KNNBasic
• KNNBasic is a basic collaborative filtering algorithm.

##### KNNWithMeans
• KNNWithMeans is basic collaborative filtering algorithm, taking into account the mean ratings of each user.

##### KNNWithZScore
• KNNWithZScore is a basic collaborative filtering algorithm, taking into account the z-score normalization of each user.

##### KNNBaseline
• KNNBaseline is a basic collaborative filtering algorithm taking into account a baseline rating.

##### SVD
• SVD algorithm is equivalent to Probabilistic Matrix Factorization (http://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf)

##### SVDpp
• The SVDpp algorithm is an extension of SVD that takes into account implicit ratings.

##### NMF
• NMF is a collaborative filtering algorithm based on Non-negative Matrix Factorization. It is very similar with SVD.

##### Slope One
• Slope One is a straightforward implementation of the SlopeOne algorithm. (https://arxiv.org/abs/cs/0702144)

##### Co-clustering

We use rmse as our accuracy metric for the predictions.

benchmark = []
# Iterate over all algorithms

# algorithms = [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]
algorithms = [SVD(), KNNWithMeans(), CoClustering()]
# print ("Attempting: ", str(algorithms), '\n\n\n')

for algorithm in algorithms:
# print("Starting: " ,str(algorithm))
print("Starting: ",str(algorithm).split(' ').split('.')[-1])
# Perform cross validation
results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=True)
# results = cross_validate(algorithm, data, measures=['RMSE','MAE'], cv=3, verbose=False)
# Get results & append algorithm name
tmp = pd.DataFrame.from_dict(results).mean(axis=0)
tmp = tmp._append(pd.Series([str(algorithm).split(' ').split('.')[-1]], index=['Algorithm']))
benchmark.append(tmp)
print("Done: " ,str(algorithm), "\n\n")

print ('\n\tDONE\n')

Starting:  SVD

Evaluating RMSE of algorithm SVD on 3 split(s).

Fold 1  Fold 2  Fold 3  Mean    Std
RMSE (testset)    0.8575  0.8708  0.8664  0.8649  0.0055
Fit time          0.64    0.65    0.63    0.64    0.01
Test time         0.19    0.19    0.19    0.19    0.00
Done:  <surprise.prediction_algorithms.matrix_factorization.SVD object at 0x7f287e4e78e0>

Starting:  KNNWithMeans
Computing the msd similarity matrix...
Done computing similarity matrix.

Computing the msd similarity matrix...
Done computing similarity matrix.

Computing the msd similarity matrix...
Done computing similarity matrix.

Evaluating RMSE of algorithm KNNWithMeans on 3 split(s).

Fold 1  Fold 2  Fold 3  Mean    Std
RMSE (testset)    0.8616  0.8734  0.8736  0.8695  0.0056
Fit time          0.07    0.09    0.08    0.08    0.01
Test time         1.66    1.66    1.66    1.66    0.00
Done:  <surprise.prediction_algorithms.knns.KNNWithMeans object at 0x7f287e4e63b0>

Starting:  CoClustering

Evaluating RMSE of algorithm CoClustering on 3 split(s).

Fold 1  Fold 2  Fold 3  Mean    Std
RMSE (testset)    0.9178  0.9188  0.9192  0.9186  0.0006
Fit time          1.06    1.08    1.09    1.08    0.01
Test time         0.21    0.14    0.21    0.19    0.03
Done:  <surprise.prediction_algorithms.co_clustering.CoClustering object at 0x7f287e4e6320>

DONE

surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')

surprise_results

test_rmse fit_time test_time
Algorithm
SVD 0.864887 0.639061 0.187395
KNNWithMeans 0.869510 0.081105 1.659812
CoClustering 0.918592 1.075265 0.187301

SVDpp is performing best but it is taking a lot of time so we will use SED instean but apply GridSearch CV.

# param_grid = {
#     "n_epochs": [5, 10, 15, 20, 30, 40, 50, 100],
#     "lr_all": [0.001, 0.002, 0.005],
#     "reg_all": [0.02, 0.08, 0.4, 0.6]
# }

# smaller grid for testing
param_grid = {
"n_epochs": [10, 20],
"lr_all": [0.002, 0.005],
"reg_all": [0.02]
}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], refit=True, cv=5)

gs.fit(data)

training_parameters = gs.best_params["rmse"]

print("BEST RMSE: \t", gs.best_score["rmse"])
print("BEST MAE: \t", gs.best_score["mae"])
print("BEST params: \t", gs.best_params["rmse"])

BEST RMSE: 	 0.855942316679972
BEST MAE: 	 0.6564051163054095
BEST params: 	 {'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}

from datetime import datetime
print(training_parameters)

print("\n\n\t\t STARTING\n\n")
start = datetime.now()

print("> OK")

print("> Creating trainset...")
trainset = data.build_full_trainset()
print("> OK")

startTraining = datetime.now()
print("> Training...")

algo = SVD(n_epochs = training_parameters['n_epochs'], lr_all = training_parameters['lr_all'], reg_all = training_parameters['reg_all'])

algo.fit(trainset)

endTraining = datetime.now()
print("> OK \t\t It Took: ", (endTraining-startTraining).seconds, "seconds")

end = datetime.now()
print (">> DONE \t\t It Took", (end-start).seconds, "seconds" )

{'n_epochs': 20, 'lr_all': 0.005, 'reg_all': 0.02}

STARTING

> OK
> Creating trainset...
> OK
> Training...

> OK 		 It Took:  0 seconds
>> DONE 		 It Took 0 seconds

## SAVING TRAINED MODEL
# from surprise import dump
# import os
# model_filename = "./model.pickle"
# print (">> Starting dump")
# # Dump algorithm and reload it.
# file_name = os.path.expanduser(model_filename)
# dump.dump(file_name, algo=algo)
# print (">> Dump done")
# print(model_filename)

# ## LOAD SAVED MODEL
#     from surprise import dump
#     import os
#     file_name = os.path.expanduser(model_filename)

from pprint import pprint as pp
# model_filename = "./model.pickle"
def itemRating(user, item):
uid = str(user)
iid = str(item)
prediction = algo.predict(user, item, verbose=True)
rating = prediction.est
details = prediction.details
uid = prediction.uid
iid = prediction.iid
true = prediction.r_ui
ret = {
'user': user,
'item': item,
'rating': rating,
'details': details,
'uid': uid,
'iid': iid,
'true': true
}
pp (ret)
print ('\n\n')
return ret
print(itemRating(user = "610", item = "10"))

user: 610        item: 10         r_ui = None   est = 3.54   {'was_impossible': False}
{'details': {'was_impossible': False},
'iid': '10',
'item': '10',
'rating': 3.543813091304151,
'true': None,
'uid': '610',
'user': '610'}

{'user': '610', 'item': '10', 'rating': 3.543813091304151, 'details': {'was_impossible': False}, 'uid': '610', 'iid': '10', 'true': None}