Wu Enda's machine learning assignment -- anomaly detection and recommendation system

Posted by m00nz00mer on Sun, 26 Dec 2021 10:16:23 +0100

anomaly detection

reference material: https://github.com/fengdu78/Coursera-ML-AndrewNg-Notes
Look at the data first:

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
from scipy.io import loadmat
import math
data = loadmat('data/ex8data1.mat')  # Xval,yval,X
X = data['X']
Xval = data['Xval']
yval = data['yval']
plt.scatter(X[:,0], X[:,1])
plt.show()


Several points are obviously off center and should be abnormal data points. Let's let the machine learn to detect these abnormal data points:
Firstly, two Gaussian functions are established according to two characteristics:

def estimate_gaussian(X):
    mu = X.mean(axis=0)  # 10. Mean() takes the mean value of all data, and X.mean(axis=0) takes the mean value of all columns
    sigma = X.var(axis=0)  # variance
    return mu, sigma


mu, sigma = estimate_gaussian(X)  # X mean (mu) and variance (sigma) of each feature


p = np.zeros((X.shape[0], X.shape[1]))  # X.shape = (307,2)
p0 = stats.norm(mu[0], sigma[0])
p1 = stats.norm(mu[1], sigma[1])
p[:,0] = p0.pdf(X[:,0])
p[:,1] = p1.pdf(X[:,1])

stats.norm automatically creates a Gaussian distribution function (normal distribution) according to the incoming mean and variance, and p0.pdf(x) brings x into the established Gaussian distribution function to get the output.
For the sake of rigor, the author manually created a function of Gaussian distribution:

def gussian(x,mu=mu[0], sigma=sigma[0]):
    s = 1/(math.sqrt(2*math.pi)*sigma)
    ss = np.exp(-1*(np.power((x-mu),2))/(2*np.power(sigma,2)))
    return s*ss

The result can not be said to be exactly the same, it can also be said to be exactly the same. (Note: because there are two characteristics, there are two Gaussian functions. Only one example is drawn here)


According to the following function, the best threshold is found through the verification set and F1 (the trade-off between precision and recall can be referred to Wu Enda's study notes v5.51166 ~ 168)

def select_threshold(pval, yval):
    best_epsilon = 0
    best_f1 = 0

    step = (pval.max() - pval.min()) / 1000

    for epsilon in np.arange(pval.min(), pval.max(), step):
        preds = pval < epsilon
        #  np.logical_and logic and
        tp = np.sum(np.logical_and(preds == 1, yval == 1)).astype(float)
        fp = np.sum(np.logical_and(preds == 1, yval == 0)).astype(float)
        fn = np.sum(np.logical_and(preds == 0, yval == 1)).astype(float)

        precision = tp / (tp + fp)
        recall = tp / (tp + fn)
        f1 = (2 * precision * recall) / (precision + recall)

        if f1 > best_f1:
            best_f1 = f1
            best_epsilon = epsilon

    return best_epsilon, best_f1

Then it is easy to distinguish the abnormal data. Bring X into the two established Gaussian functions to obtain the abnormal probability, find the data less than the threshold, mark and complete

p = np.zeros((X.shape[0], X.shape[1]))  # X.shape = (307,2)
p0 = stats.norm(mu[0], sigma[0])
p1 = stats.norm(mu[1], sigma[1])
p[:,0] = p0.pdf(X[:,0])
p[:,1] = p1.pdf(X[:,1])


pval = np.zeros((Xval.shape[0], Xval.shape[1]))  # Xval.shape = (307,2)
pval[:,0] = stats.norm(mu[0], sigma[0]).pdf(Xval[:,0])  # Column 0 of pval is the first feature of the verification set, and the Gaussian function value established according to the first feature of X is brought in
pval[:,1] = stats.norm(mu[1], sigma[1]).pdf(Xval[:,1])
epsilon, f1 = select_threshold(pval, yval)  # The optimal threshold is obtained by pval, and then the threshold epsilon is used to find the abnormal value in p
outliers = np.where(p < epsilon)
plt.scatter(X[:,0], X[:,1])
plt.scatter(X[outliers[0],0], X[outliers[0],1], s=50, color='r', marker='o')
plt.show()


When predicting new data, the mean and variance of the two features and their thresholds can be saved, the Gaussian distribution function can be constructed in the new code, and the new features can bring in anomalies less than the threshold.

Collaborative filtering

That is, the recommendation system. It's too difficult for me. This one took a long time. Someone who can think of this algorithm

The code is as follows:

import numpy as np
import scipy.optimize as opt
from scipy.io import loadmat


def serialize(X, theta):  # Develop into one-dimensional reconnection
    return np.concatenate((X.ravel(), theta.ravel()))


def deserialize(param, n_movie, n_user, n_features):  # Then reshape into a vector
    """into ndarray of X(1682, 10), theta(943, 10)"""
    return param[:n_movie * n_features].reshape(n_movie, n_features), \
           param[n_movie * n_features:].reshape(n_user, n_features)


def cost(param, Y, R, n_features):
    """compute cost for every r(i, j)=1
    Args:
        param: serialized X, theta
        Y (movie, user), (1682, 943): (movie, user) rating
        R (movie, user), (1682, 943): (movie, user) has rating
    """
    # theta (user, feature), (943, 10): user preference
    # X (movie, feature), (1682, 10): movie features
    n_movie, n_user = Y.shape
    X, theta = deserialize(param, n_movie, n_user, n_features)

    inner = np.multiply(X @ theta.T - Y, R)

    return np.power(inner, 2).sum() / 2


def gradient(param, Y, R, n_features):
    # theta (user, feature), (943, 10): user preference
    # X (movie, feature), (1682, 10): movie features
    n_movies, n_user = Y.shape
    X, theta = deserialize(param, n_movies, n_user, n_features)

    inner = np.multiply(X @ theta.T - Y, R)  # (1682, 943)

    # X_grad (1682, 10)
    X_grad = inner @ theta

    # theta_grad (943, 10)
    theta_grad = inner.T @ X

    # roll them together and return
    return serialize(X_grad, theta_grad)


def regularized_cost(param, Y, R, n_features, l=1):
    reg_term = np.power(param, 2).sum() * (l / 2)

    return cost(param, Y, R, n_features) + reg_term


def regularized_gradient(param, Y, R, n_features, l=1):
    grad = gradient(param, Y, R, n_features)
    reg_term = l * param

    return grad + reg_term


data = loadmat('data/ex8_movies.mat')
# params_data = loadmat('data/ex8_movieParams.mat')
Y = data['Y']  # The stars were not rated as 0 1682 films and 943 users
R = data['R']  # Score 1, no score 0 shape = (1682943)

# X = params_data['X'] # Movie parameters shape=(1682,10)
# theta = params_data['Theta']  # User parameters shape(943,10)
movie_list = []
with open('data/movie_ids.txt') as f:
    for line in f:
        tokens = line.strip().split(' ')  #Remove the blank at both ends and divide again
        movie_list.append(' '.join(tokens[1:])) # Remove the numbers and save only the movie names in the list
movie_list = np.array(movie_list)


# Suppose I rated some of the 1682 films
ratings = np.zeros(1682)

ratings[0] = 4
ratings[6] = 3
ratings[11] = 5
ratings[53] = 4
ratings[63] = 5
ratings[65] = 3
ratings[68] = 5
ratings[97] = 2
ratings[182] = 4
ratings[225] = 5
ratings[354] = 5

#  After insertion, rating will be at the position of 0 (insert by column), y (several stars evaluated), R (no evaluation)
# Own user 0
Y = np.insert(Y, 0, ratings, axis=1)  # now I become user 0
R = np.insert(R, 0, ratings != 0, axis=1)
movies, users = Y.shape  # (1682,944)
n_features = 10
X = np.random.random(size=(movies, n_features))
theta = np.random.random(size=(users, n_features))
lam = 1
Ymean = np.zeros((movies, 1))
Ynorm = np.zeros((movies, users))
param = serialize(X, theta)
# Mean normalization
for i in range(movies):  #R.shape = (1682,944),movies = 1682
    idx = np.where(R[i,:] == 1)[0]  # Movie i, user 0. Found movies rated too much by user 0
    Ymean[i] = Y[i,idx].mean()  # Line i, excessive averaging
    Ynorm[i,idx] = Y[i,idx] - Ymean[i]  # Subtract their mean too much

res = opt.minimize(fun=regularized_cost,
                   x0=param,
                   args=(Ynorm, R, n_features,lam),
                   method='TNC',
                   jac=regularized_gradient)
X = np.mat(np.reshape(res.x[:movies * n_features], (movies, n_features)))
theta = np.mat(np.reshape(res.x[movies * n_features:], (users, n_features)))
predictions = X * theta.T
my_preds = predictions[:, 0] + Ymean
sorted_preds = np.sort(my_preds, axis=0)[::-1]
idx = np.argsort(my_preds, axis=0)[::-1]

print("Top 10 movie predictions:")
for i in range(10):
    j = int(idx[i])
    print('Predicted rating of {} for movie {}.'.format(str(float(my_preds[j])), movie_list[j]))

The results of each run cannot be completely consistent. Our purpose is to model according to the scores we have, find what you may like, and add the problem of random initialization, the results cannot be the same every time. But generally speaking, it is feasible. For example, chestnuts, Return of the Jedi (1983), Star Wars (1977) and other films often rank in the top 10 or even the top 5

Of course, you can also let the eigenvalue n_feature=50, which may be more accurate, but the amount of computation directly increases by several orders of magnitude. After roughly pinching the table below, the result is very amazing. It takes more than 13 minutes to calculate the running memory of pycharm 4G.

Topics: Python Machine Learning numpy