Collaborative filtering recommendation algorithm

Posted by cwncool on Fri, 28 Jan 2022 12:02:06 +0100

1, Recommended model construction process

Data - > features - > ML algorithm - > prediction output

1.1 data cleaning / data processing

data sources

Dominant data
- Rating
- Comments / comments
Invisible data
- Order history
- Cart events plus shopping cart
- Page views page browsing
- Click thru Click
- Search log search record

Data volume / whether the data can meet the requirements

1.2 characteristic Engineering

Filter features from data

A given product may be purchased by users with similar tastes or needs
Use user behavior data to describe products

Representing features with data

Combine all user behaviors to form a user item matrix

1.3 select the appropriate algorithm

1.4 generate recommendation results

2, Collaborative Filtering recommendation algorithm

Algorithmic idea: birds of a feather flock together

The basic collaborative filtering recommendation algorithm is based on the following assumptions:

"You are likely to like things that people like similar to you like": user based collaborative filtering recommendation (CF)
"You are likely to like something similar to what you like": item based collaborative filtering recommendation (CF)

There are several steps to implement collaborative filtering recommendation:

Find the most similar person or item: TOP-N similar person or item

By calculating the similarity of two pairs to sort, we can find out the people or items similar to TOP-N
Generate recommendation results based on similar people or items

Use the TOP-N results to generate the initial recommendation results, and then filter out the items that the user has recorded or clearly expressed no interest

The following is a simple example. The data set is equivalent to a user's purchase record table of items: a tick indicates that the user has a purchase record of items

For similarity calculation, let's first use a simple idea: if there are two students X and y, X likes [football, basketball and table tennis] and Y likes [tennis, football, basketball and badminton], it can be seen that they have two common hobbies, so their similarity can be expressed by: 2 / 3 * 2 / 4 = 1 / 3 ≈ 0.33.

2.1 User-Based CF

2.2 Item-Based CF

Through the previous two demo s, I believe you should have a clear understanding of the design and implementation of collaborative filtering recommendation algorithm.

3, Similarity calculation

3.1 calculation method of similarity

data classification

Real value (item scoring)
Boolean value (whether the user's behavior is favorite or not)

Euclidean distance

Euclidean distance is a measure of distance in Euclidean space Two objects are represented as two points in the same space. If they are called P and Q, they are n coordinates respectively, then the Euclidean distance is to measure the distance between the two points Euclidean distance does not apply between Boolean vectors

The value of Euclidean distance is a non negative number, and the maximum value is positive and infinite. Generally, the result of calculating similarity is expected to be between [- 1,1] or [0,1], which can be used generally

The following conversion formula:

Jacquard similarity & cosine similarity & Pearson correlation coefficient

cosine similarity
- It measures the angle between two vectors, and uses the cosine of the angle to measure similar situations
- When the angle between two vectors is 0, the cosine value is 1. When the angle is 90 degrees, the cosine value is 0 and 180 degrees, the cosine value is - 1
- Cosine similarity is commonly used to measure text similarity, user similarity and item similarity
- The characteristic of cosine similarity has nothing to do with the vector length. The calculation of cosine similarity should normalize the vector length. As long as the two vectors have the same direction, they can be regarded as' similar 'regardless of the degree
Pearson correlation coefficient
- In fact, it is also a kind of cosine similarity, but first centralize the vector, subtract the mean value of vector a and B respectively, and then calculate the cosine similarity
- Pearson similarity calculation results are between - 1 and 1, - 1 indicates negative correlation and 1 indicates positive correlation
- Measure whether two variables increase and decrease at the same time
- Pearson correlation coefficient measures whether the change trends of the two variables are consistent, which is not suitable for calculating the correlation between Boolean vectors
Jaccard
- The proportion of the number of intersection elements of two sets in the union set is very suitable for Boolean vector representation
- A molecule is a dot product of two Boolean vectors, and the result is the number of intersection elements
- Denominator is the sum of two Boolean vectors
Cosine similarity is suitable for user rating data (real value), and jackard similarity is suitable for implicit feedback data (0,1 Boolean value) (collect, click, add shopping cart)

cosine similarity
Pearson correlation coefficient

3.2 user similarity calculation

Calculate the similarity between user 1 and other users

Sort according to the similarity, k nearest neighbor, such as K, takes 4:

Take out the shopping list of nearby users

Remove the goods that user 1 has purchased

Sort the remaining items according to the score

3.3 item similarity calculation

Cosine similarity is not sensitive to the absolute value
- User A scored the two movies 1 and 2 respectively. User B scored the same two movies 4 and 5 respectively. The similarity between the two users was 0.98 calculated by cosine similarity
- The improved cosine similarity can be used. First calculate the mean value of each dimension of the vector, and then subtract the mean value of each vector in each dimension to calculate the cosine similarity. The similarity calculated with the adjusted cosine similarity is -0.1

Item similarity calculation case

Find similar items of item 1

Select the nearest item

4, Code implementation of collaborative filtering recommendation algorithm

4.1 building data sets

During calculation, we usually need to process or encode the data in order to facilitate our calculation and processing of the data. For example, here is a relatively simple case. We use 1 and 0 respectively to indicate whether the user has purchased the item. Then our data set is as follows:

import pandas as pd
users = ["User1", "User2", "User3", "User4", "User5"]
items = ["Item A", "Item B", "Item C", "Item D", "Item E"]
# User purchase record data set
datasets = [
    [1,0,1,1,0],
    [1,0,0,1,1],
    [1,0,1,0,0],
    [0,1,0,1,1],
    [1,1,1,0,1],
]
df = pd.DataFrame(datasets,columns=items,index=users)
print(df)

  Item A  Item B  Item C  Item D  Item E
User1       1       0       1       1       0
User2       1       0       0       1       1
User3       1       0       1       0       0
User4       0       1       0       1       1
User5       1       1       1       0       1

4.2 similarity calculation

With the data set, we can calculate the similarity, but there are many special similarity calculation methods, such as cosine similarity, Pearson correlation coefficient, jackard similarity and so on. Here we choose to use the jacquard similarity coefficient [0,1]

# Directly calculate the jackard similarity coefficient of some two terms
from sklearn.metrics import jaccard_score
# Calculate the similarity between Item A and Item B
print(jaccard_score(df["Item A"], df["Item B"]))

0.2

# Calculate the jackard similarity coefficient of all data
# pairwise_distance refers to calculating the distance between the corresponding elements of two input matrices X and Y
from sklearn.metrics.pairwise import pairwise_distances
# Calculate the similarity between users
user_similar = 1 - pairwise_distances(df.values, metric="jaccard")
user_similar = pd.DataFrame(user_similar, columns=users, index=users)
print("Pairwise similarity between users:")
print(user_similar)

Pairwise similarity between users:
          User1  User2     User3  User4  User5
User1  1.000000   0.50  0.666667    0.2    0.4
User2  0.500000   1.00  0.250000    0.5    0.4
User3  0.666667   0.25  1.000000    0.0    0.5
User4  0.200000   0.50  0.000000    1.0    0.4
User5  0.400000   0.40  0.500000    0.4    1.0

# Calculate the similarity between items
item_similar = 1 - pairwise_distances(df.values.T, metric="jaccard")
item_similar = pd.DataFrame(item_similar, columns=items, index=items)
print("Pairwise similarity between items:")
print(item_similar)

Pairwise similarity between items:
        Item A    Item B  Item C  Item D    Item E
Item A    1.00  0.200000    0.75    0.40  0.400000
Item B    0.20  1.000000    0.25    0.25  0.666667
Item C    0.75  0.250000    1.00    0.20  0.200000
Item D    0.40  0.250000    0.20    1.00  0.500000
Item E    0.40  0.666667    0.20    0.50  1.000000

With pairwise similarity, we can then screen the TOP-N similarity results and recommend them

4.3 User-Based CF

import pandas as pd
import numpy as np
from pprint import pprint

topN_users = {}
# Traverse each row of data
for i in user_similar.index:
    # Take out each column of data, delete itself, and then sort the data
    _df = user_similar.loc[i].drop([i])
    _df_sorted = _df.sort_values(ascending=False)

    top2 = list(_df_sorted.index[:2])
    topN_users[i] = top2

print("Top2 Similar users:")
pprint(topN_users)

rs_results = {}
# Build recommendation results
for user, sim_users in topN_users.items():
    rs_result = set()    # Store recommendation results
    for sim_user in sim_users:
        # Build initial recommendation results
        rs_result = rs_result.union(set(df.loc[sim_user].replace(0,np.nan).dropna().index))
    # Filter out purchased items
    rs_result -= set(df.loc[user].replace(0,np.nan).dropna().index)
    rs_results[user] = rs_result
print("Final recommendation result:")
pprint(rs_results)

Top2 Similar users:
{'User1': ['User3', 'User2'],
 'User2': ['User1', 'User4'],
 'User3': ['User1', 'User5'],
 'User4': ['User2', 'User5'],
 'User5': ['User3', 'User1']}
Final recommendation result:
{'User1': {'Item E'},
 'User2': {'Item C', 'Item B'},
 'User3': {'Item E', 'Item D', 'Item B'},
 'User4': {'Item C', 'Item A'},
 'User5': {'Item D'}}

4.4 Item-Based CF

import pandas as pd
import numpy as np
from pprint import pprint

topN_items = {}
# Traverse each row of data
for i in item_similar.index:
    # Take out each column of data, delete itself, and then sort the data
    _df = item_similar.loc[i].drop([i])
    _df_sorted = _df.sort_values(ascending=False)

    top2 = list(_df_sorted.index[:2])
    topN_items[i] = top2

print("Top2 Similar items:")
pprint(topN_items)

rs_results = {}
# Build recommendation results
for user in df.index:    # Traverse all users
    rs_result = set()
    for item in df.loc[user].replace(0,np.nan).dropna().index:   # Take out the current shopping list of each user
        # Find the most similar TOP-N items according to each item and build the initial recommendation result
        rs_result = rs_result.union(topN_items[item])
    # Filter out items purchased by users
    rs_result -= set(df.loc[user].replace(0,np.nan).dropna().index)
    # Add to results
    rs_results[user] = rs_result

print("Final recommendation result:")
pprint(rs_results)

Top2 Similar items:
{'Item A': ['Item C', 'Item D'],
 'Item B': ['Item E', 'Item C'],
 'Item C': ['Item A', 'Item B'],
 'Item D': ['Item E', 'Item A'],
 'Item E': ['Item B', 'Item D']}
Final recommendation result:
{'User1': {'Item E', 'Item B'},
 'User2': {'Item C', 'Item B'},
 'User3': {'Item D', 'Item B'},
 'User4': {'Item A', 'Item C'},
 'User5': {'Item D'}}

4.5 others

On the data set used by collaborative filtering recommendation algorithm

In the previous demo, we only use a purchase record of the user's items. Similarly, we can also use browsing click records, listening records and so on. In fact, the result of our prediction is equivalent to predicting whether users are interested in an item, and we can't predict the degree of preference very well.

Therefore, in the collaborative filtering recommendation algorithm, we will actually make more use of the user's "rating" data of items to predict. Through the rating data set, we can predict the user's rating of items he has not rated. The implementation principle and idea are the same as, except that the data set used is the user item scoring data.

About user item scoring matrix

There are different solutions for the user item scoring matrix according to the sparsity of the scoring matrix

Dense scoring matrix
Sparse scoring matrix

Here we first introduce the processing of dense scoring matrix. The processing of sparse matrix will be relatively complex. We will introduce it later.

5, Collaborative filtering recommendation algorithm predicts users' scores

5.1 building data sets

Objective: to predict the score of user 1 on item E

Note that when constructing score data here, we need to keep the missing part as None. If it is set to 0, it will be treated as a score value of 0

users = ["User1", "User2", "User3", "User4", "User5"]
items = ["Item A", "Item B", "Item C", "Item D", "Item E"]
# User purchase record data set
datasets = [
    [5,3,4,4,None],
    [3,1,2,3,3],
    [4,3,4,3,5],
    [3,3,1,5,4],
    [1,5,5,2,1],
]

5.2 calculating similarity

For the score data, we use Pearson correlation coefficient [- 1,1] to calculate, - 1 indicates strong negative correlation and + 1 indicates strong positive correlation

The corr method in pandas can be directly used to calculate the Pearson correlation coefficient

df = pd.DataFrame(datasets,columns=items,index=users)

print("Pairwise similarity between users:")
# Direct calculation of Pearson correlation coefficient
# By default, it is calculated by column. Therefore, if the similarity between users is calculated, it needs to be transposed
user_similar = df.T.corr()
print(user_similar.round(4))

print("Pairwise similarity between items:")
item_similar = df.corr()
print(item_similar.round(4))

Operation results:

Pairwise similarity between users:
        User1   User2   User3   User4   User5
User1  1.0000  0.8528  0.7071  0.0000 -0.7921
User2  0.8528  1.0000  0.4677  0.4900 -0.9001
User3  0.7071  0.4677  1.0000 -0.1612 -0.4666
User4  0.0000  0.4900 -0.1612  1.0000 -0.6415
User5 -0.7921 -0.9001 -0.4666 -0.6415  1.0000
 Pairwise similarity between items:
        Item A  Item B  Item C  Item D  Item E
Item A  1.0000 -0.4767 -0.1231  0.5322  0.9695
Item B -0.4767  1.0000  0.6455 -0.3101 -0.4781
Item C -0.1231  0.6455  1.0000 -0.7206 -0.4276
Item D  0.5322 -0.3101 -0.7206  1.0000  0.5817
Item E  0.9695 -0.4781 -0.4276  0.5817  1.0000

It can be seen that the most similar to user 1 are user 2 and user 3; The items most similar to item A are item E and item D, respectively.

Note: when we predict the score, we often predict through users or items with positive correlation. If there is no positive correlation, we will not be able to make prediction. This is especially common in the sparse scoring matrix, because it is difficult to obtain the positive correlation coefficient in the sparse scoring matrix.

5.3 score prediction

User based CF score prediction: use the similarity between users for prediction

There are also many schemes for score prediction. The following introduces a scheme with good effect, which considers the user's own score and the weighted average similarity score of adjacent users to predict:

\[pred(u,i)=\hat{r}_{ui}=\cfrac{\sum_{v\in U}sim(u,v)*r_{vi}}{\sum_{v\in U}|sim(u,v)|} \]

If we want to predict the score of user 1 on item E, we can predict it according to user 2 and user 3 nearest to user 1. The calculation is as follows:

\[pred(u_1, i_5) =\cfrac{0.85*3+0.71*5}{0.85+0.71} = 3.91 \]

The similarity between user 1 and user 2 is 0.85, the similarity between user 1 and user 3 is 0.71, the score of user 2 on item E is 3, and the score of user 3 on item E is 5

Finally, it is predicted that the score of user 1 on item 5 is 3.91

Item based CF score prediction: use the similarity between items for prediction

Here, the calculation of similarity prediction of items is the same as above. The average scoring factors of users are also considered, and the prediction is carried out in combination with the weighted average similarity scoring of predicted items and similar items

\[pred(u,i)=\hat{r}_{ui}=\cfrac{\sum_{j\in I_{rated}}sim(i,j)*r_{uj}}{\sum_{j\in I_{rated}}sim(i,j)} \]

If we want to predict the score of user 1 on item E, we can predict it according to item A and item D nearest to item E, which are calculated as follows:

\[pred(u_1, i_5) = \cfrac {0.97*5+0.58*4}{0.97+0.58} = 4.63 \]

It can be seen from the comparison that there are also differences between the user based CF prediction score and the item based CF score, because strictly speaking, they should belong to two different recommendation algorithms, which will be better than the other in different fields and scenarios, but which is better must be evaluated reasonably, Therefore, when implementing the recommendation system, these two algorithms often need to be implemented, and then evaluate and analyze the recommendation effect to select the better scheme.

6, Model based approach

6.1 ideas

Through the machine learning algorithm, find the pattern in the data, and model the interaction between users and items
The model-based collaborative filtering method is to build a more advanced algorithm of collaborative filtering
The problem of nearest neighbor model
- There is correlation between items, and the amount of information does not increase linearly with the increase of vector dimension
- The matrix elements are sparse, the calculation results are unstable, and a vector dimension is increased or decreased, resulting in great differences in the nearest neighbor results

6.2 algorithm classification

Graph based model

Neighborhood based models are seen as simple forms of graph based models

principle

The user's behavior data is represented as a bipartite graph
Recommendation for users based on bipartite graph
The correlation between two vertices is evaluated according to the number of paths between two vertices, path length and the number of vertices

Model based on matrix decomposition