Movie Recommendation Using Surprise Package

Posted by dopp on Wed, 15 May 2019 15:35:12 +0200

Surprise (Simple Python Recommendation System Engine) is a recommendation system library, which is one of the scikit series. Simple and easy to use, while supporting a variety of recommendation algorithms (basic algorithm, collaborative filtering, matrix decomposition, etc.).

The following purposes are taken into account when designing surprise:

Let users control their experiments perfectly. To this end, the document is emphasized in particular, trying to be as clear and accurate as possible by pointing out every detail of the algorithm.
Reduce the pain of data set processing. Users can use built-in data sets (Jester) and their own custom data sets.
Provide a variety of ready-to-use prediction algorithms, such as baseline algorithm, neighborhood method, matrix factorization (SVD, PMF, SVD ++, NMF) and so on. In addition, various similarity measures (cosine, MSD, Pearson...) are built in.
New algorithm ideas can be easily realized.
Provides tools for evaluating, analyzing and comparing the performance of algorithms. Using powerful CV iterators (inspired by excellent scikit-learn ing tools) and detailed searches for a set of parameters, cross validation programs can be run very easily.

1.Surprise installation

pip install numpy
pip install scikit-surprise

Make sure that numpy module is installed before installation.

2. Basic Algorithms

Algorithm class name	Explain
random_pred.NormalPredictor	A prediction value is given randomly according to the distribution characteristics of training set.
baseline_only.BaselineOnly	Given users and Item, an estimate based on baseline is given
knns.KNNBasic	Basic Collaborative Filtering
knns.KNNWithMeans	Implementation of Collaborative Filtering Considering the Mean of Each User's Score
knns.KNNBaseline	Collaborative filtering considering baseline ratings
matrix_factorization.SVD	SVD implementation
matrix_factorization.SVDpp	SVD++, LFM+SVD
matrix_factorization.NMF	Collaborative filtering based on matrix decomposition
slope_one.SlopeOne	A Simple but Accurate Collaborative Filtering Algorithm
co_clustering.CoClustering	Collaborative filtering algorithm based on Collaborative clustering

Neighbor-based method (collaborative filtering) can set different metrics.

Similarity measure	Description of metrics
cosine	Calculate cosine similarity between all user (or item) pairs.
msd	Calculate the mean square difference similarity between all users (or items).
pearson	Calculate Pearson correlation coefficients between all user (or item) pairs.
pearson_baseline	The Pearson correlation coefficients between pairs of users (or items) are calculated, and the baseline is used to center rather than average the Pearson correlation coefficients.

Supporting different evaluation criteria

Evaluation criteria	Guidelines specification
rmse	RMSE (root mean square error) is calculated.
mae	Calculate MAE (mean absolute error).
fcp	Calculate FCP (score of coordination pair).

3.Surprise use

(1) Load the data set that comes with it

#-*- coding:utf-8 -*-
# You can use the various recommendation system algorithms mentioned above
from surprise import SVD
from surprise import Dataset, print_perf
from surprise.model_selection import cross_validate

# Default loading of movielens dataset
data = Dataset.load_builtin('ml-100k')
# K-fold cross validation (k=3), which is now discarded
# data.split(n_folds=3)
# Try SVD Matrix Decomposition
algo = SVD()
# Test the effect on the data set
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
#Output result
print_perf(perf)

Operation results:

(2) Load your own dataset

from surprise import SVD
from surprise import Dataset, print_perf, Reader
from surprise.model_selection import cross_validate
import os

# Specify the path of the file
file_path = os.path.expanduser('data.csv')
# Tell the text reader what the format of the text is
reader = Reader(line_format='user item rating', sep=',')
# Loading data
data = Dataset.load_from_file(file_path, reader=reader)
algo = SVD()
# Test the effect on the data set
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
# Output results
print_perf(perf)

Attention should be paid to:

1. Can not recognize Chinese, if there is Chinese, it needs to be converted into ID number for operation (listed below is a simple conversion method)

2. You can't have a header. You need to remove Chinese columns from the header and metadata.

3. Reader needs to be modified. line_format is the column of data, and sep is the separation mode (the initial division mode of table format is',').

A simple data conversion method:

#-*- coding:utf-8 -*-
# Constructing Item id

import pandas as pd

df = pd.read_csv('train_score.csv', encoding="gbk")
# Read the data in the second column
item_name = df.iloc[:, 1]
item = {}
item_id = []
num = 0
# Associate each different item with id number
for i in item_name:
    if i in item:
        item_id.append(item[i])
    else:
        item[i] = num
        item_id.append(num)
        num += 1
print item_id
df['itemId'] = item_id
df.to_csv("data.csv", encoding="gbk", index=False)

4. Algorithmic parameter adjustment

The algorithm implemented here uses nothing more than SGD and so on, so there are also some hyperparameters that affect the final result. We can also use GridSearchCV, which is commonly used in sklearn s, to select the optimal parameters. A simple example is as follows:

# Define the grid of parameters that need to be optimized
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
# Cross-validation using grid search
grid_search = GridSearch(SVD, param_grid, measures=['RMSE', 'FCP'])
# Find the best parameters on the data set
data = Dataset.load_builtin('ml-100k')
data.split(n_folds=3)
grid_search.evaluate(data)
# Output tuning parameter set 
# Output the best RMSE results
print(grid_search.best_score['RMSE'])
# >>> 0.96117566386

# Output parameters corresponding to the best RMSE results
print(grid_search.best_params['RMSE'])
# >>> {'reg_all': 0.4, 'lr_all': 0.005, 'n_epochs': 10}

# Best FCP score
print(grid_search.best_score['FCP'])
# >>> 0.702279736531

# Parameters corresponding to the highest FCP score
print(grid_search.best_params['FCP'])
# >>> {'reg_all': 0.6, 'lr_all': 0.005, 'n_epochs': 10}

GridSearchCV method:

# Define the grid of parameters that need to be optimized
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
# Cross-validation using grid search
grid_search = GridSearchCV(SVD, param_grid, measures=['RMSE', 'FCP'], cv=3)
# Find the best parameters on the data set
data = Dataset.load_builtin('ml-100k')
# pref = cross_validate(grid_search, data, cv=3)
grid_search.fit(data)
# Output tuning parameter set
# Output the best RMSE results
print(grid_search.best_score)

1.estimator

Select the classifier to use, and pass in parameters other than those that need to determine the best parameters.

Each classifier needs a scoring parameter, or score method:

For example, estimator = Random Forest Classifier(

min_samples_split=100,

min_samples_leaf=20,

max_depth=8,

max_features='sqrt',

random_state=10),

2.param_grid

The value of the parameter that needs to be optimized is a dictionary or a list, for example:

param_grid =param_test1，

param_test1 = {'n_estimators':range(10,71,10)}.

3. scoring=None

Model evaluation criteria, default None, need to use score function; or scoring='roc_auc',

According to the selected models, the evaluation criteria are different. String (function name), or callable object,

Function signatures such as scorer(estimator, X, y) are required; if nonee, estimator's error estimation function is used.

4.n_jobs=1

n_jobs: Parallel Number, int: Number, -1: Consistent with CPU Number, 1: Default Value

5.cv=None

Cross validation parameters, default None, use three fold cross validation. Specify the number of folds by default of 3. It can also be a generator for yield ing training / test data.

6.verbose=0, scoring=None

verbose: log redundancy length, int: redundancy length, 0: no output training process, 1: occasional output, > 1: output for each sub-model.

7.pre_dispatch='2*n_jobs'

Specifies the total number of parallel tasks to be distributed. When n_jobs is greater than 1, the data will be replicated at each runtime, which may lead to OOM.

By setting the pre_dispatch parameter, you can pre-divide the total number of job s so that the data can be copied up to pre_dispatch times.

8.return_train_score='warn'

If "False", the cv_results_attribute will not include the training score.

9.refit: By default True, the program will re-run all available training and development sets with the best parameters obtained from cross-validation training sets.

As the best model parameter for performance evaluation. That is to say, after the search parameters are finished, all data sets are fit ted again with the best parameter results.

10.iid: The default True, when True, is that the fold probability distribution of each sample is the same, and the error is estimated as the sum of all samples, not the average of each fold.

Common methods and attributes for forecasting

grid.fit(): Run grid search

grid_scores_: Give the evaluation results under different parameters

best_params_: Describes the combination of parameters that have yielded the best results

best_score_: Members provide the best reviews observed during the optimization process

5. Modeling and comparison using different recommendation system algorithms

from surprise import Dataset, print_perf
from surprise.model_selection import cross_validate
data = Dataset.load_builtin('ml-100k')
### Using Normal Predictor
from surprise import NormalPredictor
algo = NormalPredictor()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print_perf(perf)

### Use BaselineOnly
from surprise import BaselineOnly
algo = BaselineOnly()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print_perf(perf)

### Collaborative filtering using Base Edition
from surprise import KNNBasic, evaluate
algo = KNNBasic()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print_perf(perf)

### Using Mean Collaborative Filtering
from surprise import KNNWithMeans, evaluate
algo = KNNWithMeans()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print_perf(perf)

### Using collaborative filtering baseline
from surprise import KNNBaseline, evaluate
algo = KNNBaseline()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print_perf(perf)

### Using SVD
from surprise import SVD, evaluate
algo = SVD()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print_perf(perf)

### Using SVD++
from surprise import SVDpp, evaluate
algo = SVDpp()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print_perf(perf)

### Using NMF
from surprise import NMF
algo = NMF()
perf = cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=3)
print_perf(perf)

6. Recommended examples of movielens

#-*- coding:utf-8 -*-
from __future__ import (absolute_import, division, print_function,
                        unicode_literals)
import os
import io
from surprise import KNNBaseline
from surprise import Dataset

import logging

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s',
                    datefmt='%a， %d %b %Y %H:%M:%S')


# Training recommendation model steps: 1
def getSimModle():
    # Default loading of movielens dataset
    data = Dataset.load_builtin('ml-100k')
    trainset = data.build_full_trainset()
    #Using pearson_baseline method to calculate similarity False calculates similarity between movies based on item
    sim_options = {'name': 'pearson_baseline', 'user_based': False}
    ##Using KNNBaseline algorithm
    algo = KNNBaseline(sim_options=sim_options)
    #Training model
    algo.fit(trainset)
    return algo


# Getting id to name mapping steps: 2
def read_item_names():
    """
    //Get the mapping of movie name to movie id and movie id to movie name
    """
    file_name = (os.path.expanduser('~') +
                 '/.surprise_data/ml-100k/ml-100k/u.item')
    rid_to_name = {}
    name_to_rid = {}
    with io.open(file_name, 'r', encoding='ISO-8859-1') as f:
        for line in f:
            line = line.split('|')
            rid_to_name[line[0]] = line[1]
            name_to_rid[line[1]] = line[0]
    return rid_to_name, name_to_rid


# Recommendation steps for related movies based on the previous training model:3
def showSimilarMovies(algo, rid_to_name, name_to_rid):
    # Get raw_id of the movie Toy Story (1995)
    toy_story_raw_id = name_to_rid['Toy Story (1995)']
    logging.debug('raw_id=' + toy_story_raw_id)
    #Converting raw_id of a movie to the internal ID of the model
    toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
    logging.debug('inner_id=' + str(toy_story_inner_id))
    #Get Recommended Movies from Model Here are 10
    toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, 10)
    logging.debug('neighbors_ids=' + str(toy_story_neighbors))
    #The internal id of the model is converted to the actual movie id
    neighbors_raw_ids = [algo.trainset.to_raw_iid(inner_id) for inner_id in toy_story_neighbors]
    #Get a movie id list or a movie recommendation list
    neighbors_movies = [rid_to_name[raw_id] for raw_id in neighbors_raw_ids]
    print('The 10 nearest neighbors of Toy Story are:')
    for movie in neighbors_movies:
        print(movie)


if __name__ == '__main__':
    # Get the mapping of id to name
    rid_to_name, name_to_rid = read_item_names()

    # Training Recommendation Model
    algo = getSimModle()

    ##Display related movies
    showSimilarMovies(algo, rid_to_name, name_to_rid)