Ex6_ Machine learning_ Wu Enda course assignment (Python): SVM Support Vector Machines

Posted by Fatboy on Sun, 30 Jan 2022 00:23:31 +0100

Ex6_ Machine learning_ Wu Enda course assignment (Python): SVM Support Vector Machines

instructions:

This article is about Mr. Wu Enda's learning notes of the machine learning course on Coursera.

  • The first part of this paper first introduces the knowledge review and key notes of the corresponding week of the course, as well as the introduction of the code implementation library.
  • The second part of this paper includes the implementation details of user-defined functions in the code implementation part.
  • The third part of this paper is the specific code implementation corresponding to the course practice topic.

0. Pre-condition

This section includes some introductions of libraries.

# This file includes self-created functions used in exercise 3
import numpy as np
import matplotlib.pyplot as plt
import scipy.optimize as opt
from scipy.io import loadmat

00. Self-created Functions

This section includes self-created functions.

  • loadData(path): read data

    # Load data from the given file
    # Args: {path: data path}
    def loadData(path):
        data = loadmat(path)
        return data['X'], data['y']
    
  • plotData(X, y): visual data

    # Visualize data
    # Args: {X: training set; y: tag set}
    def plotData(X, y):
        plt.figure(figsize=[8, 6])
        plt.scatter(X[:, 0], X[:, 1], c=y.flatten())
        plt.xlabel('X1')
        plt.ylabel('X2')
        plt.title('Data Visualization')
        # plt.show()
    
    
  • Plot boundary (classifier, x): draw the decision boundary between categories

    # Plot the boundary between two classes
    # Args: {classifier: classifier; X: training set}
    def plotBoundary(classifier, X):
        x_min, x_max = X[:, 0].min() * 1.2, X[:, 0].max() * 1.1
        y_min, y_max = X[:, 1].min() * 1.2, X[:, 1].max() * 1.1
        xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500),
                             np.linspace(y_min, y_max, 500))
        # Use the incoming classifier to make category prediction for the prediction samples
        Z = classifier.predict(np.c_[xx.flatten(), yy.flatten()])
        Z = Z.reshape(xx.shape)
        plt.contour(xx, yy, Z)
    
  • displayBoundaries(X, y): draw the decision boundary (linear kernel) under different SVM parameters C

    # Display boundaries for different situations with different C (1 and 100)
    # Change the SVM parameter C and draw the decision boundary in each case
    # Args: {X: training set; y: tag set}
    def displayBoundaries(X, y):
        # Here, skilearn's package is used to obtain multiple SVM models by using linear kernel function
        models = [svm.SVC(C=C, kernel='linear') for C in [1, 100]]
        # Given the training set X and label set y, multiple SVM models are trained to obtain multiple classifiers
        classifiers = [model.fit(X, y.flatten()) for model in models]
        # Output information
        titles = ['SVM Decision Boundary with C = {}'.format(C) for C in [1, 100]]
        # For each classifier, draw its decision boundary
        for classifier, title in zip(classifiers, titles):
            plotData(X, y)
            plotBoundary(classifier, X)
            plt.title(title)
        # Display data
        plt.show()
    
  • Gaussian kernel (x1, X2, sigma): realize Gaussian kernel function

    # Implement a Gaussian kernel function (Could be considered as a similarity function)
    # Realize Gaussian kernel function (which can be regarded as similarity function to measure the distance between a pair of samples)
    # Args: {X1: sample 1; x2: Sample 2; sigma: Gaussian kernel function parameter}
    def gaussianKernel(x1, x2, sigma):
        return np.exp(-(np.power(x1 - x2, 2).sum() / (2 * np.power(sigma, 2))))
    
  • displayGaussKernelBoundary(X, y, C, sigma): draw the decision boundary of Gaussian kernel SVM for a data set

    # Display the decision boundary using SVM with a Gaussian kernel
    # Draw the decision boundary of SVM based on Gaussian kernel for a data set
    # Args: {X: training set; y: label set; C: SVM parameter; sigma: Gaussian kernel function parameter}
    def displayGaussKernelBoundary(X, y, C, sigma):
        gamma = np.power(sigma, -2.) / 2
        # 'rbf' refers to radial basis function / Gaussian kernel function
        model = svm.SVC(C=1, kernel='rbf', gamma=gamma)
        classifier = model.fit(X, y.flatten())
        plotData(X, y)
        plotBoundary(classifier, X)
        plt.title('Decision boundary using SVM with a Gaussian Kernel')
        plt.show()
    
  • trainGaussParams(X, y, Xval, yval): compare the error of cross validation set and train the optimal parameters C and sigma

    # Train out the best parameters 'C' and 'sigma" with the least cost on the validation set
    # By comparing the errors in the cross validation set, the optimal parameters C and sigma are trained
    # Args: {X: training set; y: tag set; Xval: training cross validation set; yval: tag cross validation set}
    def trainGaussParams(X, y, Xval, yval):
        C_values = (0.01, 0.03, 0.1, 0.3, 1., 3., 10., 30.)
        sigma_values = C_values
        best_pair, best_score = (0, 0), 0
        for C in C_values:
            for sigma in sigma_values:
                gamma = np.power(sigma, -2.) / 2
                model = svm.SVC(C=C, kernel='rbf', gamma=gamma)
                classifier = model.fit(X, y.flatten())
                this_score = model.score(Xval, yval)
                if this_score > best_score:
                    best_score = this_score
                    best_pair = (C, sigma)
        print('Best pair(C, sigma): {}, best score: {}'.format(best_pair, best_score))
        return best_pair[0], best_pair[1]
    
  • Preprocess email (email): preprocess mail

    # Preprocess an email
    # Performs all processing except Word Stemming and removal of non words
    def preprocessEmail(email):
        # Full text lowercase
        email = email.lower()
        # Unify HTML format. Matching the beginning of < and all contents that are not <, > until the end of > is equivalent to matching <... >
        email = re.sub('<[^<>]>', ' ', email)
        # Unify URLs. Convert all URL addresses to "httpadddr".
        email = re.sub('(http|https)://[^\s]*', 'httpaddr', email)
        # Unified email address. Convert all email addresses to "emailaddr".
        email = re.sub('[^\s]+@[^\s]+', 'emailaddr', email)
        # Unify dollar sign.
        email = re.sub('[\$]+', 'dollar', email)
        # Unify numbers.
        email = re.sub('[\d]+', 'number', email)
        return email
    
  • email2TokenList(email): extract the stem and remove the non character content to return the word list

    # Conduct Word Stemming and Removal of non-words.
    # Besides, here we use "NLTK" lib's stemmer, since it's more accurate and efficient.
    # Perform the processing of stem extraction and removing non character content, and return the processed words one by one
    # The extractor of NLTK package is used here, which is more efficient and accurate
    def email2TokenList(email):
        # Preprocess the email
        email = preprocessEmail(email)
        # Instantiate the stemmer
        stemmer = nltk.stem.porter.PorterStemmer()
        # Split the whole email into separated words
        tokens = re.split('[ \@\$\/\#\.\-\:\&\*\+\=\[\]\?\!\(\)\{\}\,\'\"\>\_\<\;\%]', email)
        # Traverse all the split contents
        token_list = []
        for token in tokens:
            # Remove non word contents deletes any non alphanumeric characters
            token = re.sub('[^a-zA-Z0-9]', '', token)
            # Stem the root of the word
            stemmed_word = stemmer.stem(token)
            # Remove empty string removes the empty string '', which does not contain any characters. It is not added
            if not len(token): continue
            # Append the word into the list
            token_list.append(stemmed_word)
        return token_list
    
  • email2VocabularyList(email, vocab_list): get the index of words appearing in both email and vocabulary

    # Get the indices of words that exist both in the email and the vocabulary list
    # Gets the index of words that appear simultaneously in the message and vocabulary
    # Args: {email: mail; vocab_list: word list}
    def email2VocabularyList(email, vocab_list):
        token = email2TokenList(email)
        index = [i for i in range(len(vocab_list)) if vocab_list[i] in token]
        return index
    
  • email2FeatureVector(email): extract features of email

    # Extract features from email, turn the email into a feature vector
    # Extract the features of the mail and obtain a feature vector representing the mail (the length is the length of the word list. If the word exists, the corresponding subscript position value is 1, otherwise it is 0)
    # Args: {email:}
    def email2FeatureVector(email):
        # Word list provided
        df = pd.read_table('../data/vocab.txt', names=['words'])
        vocab_list = np.asmatrix(df)
        # The length is the same as that of the word list
        feature_vector = np.zeros(len(vocab_list))
        # If the word exists in the email, the corresponding subscript position value is 1, otherwise it is 0
        vocab_indices = email2VocabularyList(email, vocab_list)
        for i in vocab_indices:
            feature_vector[i] = 1
        return feature_vector
    

1. Support Vector Machines

In the fifirst half of this exercise, you will be using support vector machines (SVMs) with various example 2D datasets.

Experimenting with these datasets will help you gain an intuition of how SVMs work and how to use a Gaussian kernel with SVMs.

In the next half of the exercise, you will be using support vector machines to build a spam classififier.

  • The relevant functions called are described in detail in the article header "self created functions".
# 1. Support Vector Machines
path = '../data/ex6data1.mat'
X, y = func.loadData(path)

1.1 Example dataset 1

# 1.1 Example dataset 1
# Visual data
func.plotData(X, y)

# Try different parameters C and draw the decision boundary in various cases
func.displayBoundaries(X, y)
  • Data visualization:

  • Decision boundary (linear kernel, C = 1):

  • Decision boundary (linear kernel, C = 100):

As can be seen from the above figure:

  • When C C C) larger (i.e 1 / λ 1/\lambda 1/ λ Larger, λ \lambda λ When it is small, the punishment of misclassification by the model increases, which is more strict, less misclassification and smaller interval.
  • When C C C) smaller (i.e 1 / λ 1/\lambda 1/ λ Smaller, λ \lambda λ When it is large, the punishment of misclassification by the model is reduced and loose, allowing a certain misclassification to exist with a large interval.

1.2 SVM with Gaussian Kernels

In order to find the nonlinear decision boundary with SVM, we first need to implement Gaussian kernel function. I can think of Gaussian kernel function as a similarity function to measure the distance between a pair of samples ( x ( i ) , y ( j ) ) (x^{(i)}, y^{(j)}) (x(i),y(j))​​.

Note that most SVM libraries will automatically add additional features for you x 0 x_0 x0 , and θ 0 \theta_0 θ 0, so there is no need to add it manually.

# 1.2 SVM with Gaussian Kernels SVM based on Gaussian kernel function
path2 = '../data/ex6data2.mat'
X2, y2 = func.loadData(path2)

path3 = '../data/ex6data3.mat'
df3 = loadmat(path3)
X3, y3, Xval, yval = df3['X'], df3['y'], df3['Xval'], df3['yval']

1.2.1 Gaussian Kernel

# 1.2.1 Gaussian Kernel Gaussian kernel function
res_gaussianKernel = func.gaussianKernel(np.array([1, 2, 1]), np.array([0, 4, -1]), 2.)
print(res_gaussianKernel)  # 0.32465246735834974

1.2.2 Example dataset 2

# 1.2.2 Example dataset 2
# Visual data
func.plotData(X2, y2)

# Draw the decision boundary of SVM based on Gaussian kernel function for data set
func.displayGaussKernelBoundary(X2, y2, C=1, sigma=0.1)
  • Data visualization:

  • Decision boundary (Gaussian kernel):

1.2.3 Example dataset 3

# 1.2.3 Example dataset 3
# Visual data
func.plotData(X3, y3)

# Training parameters C and sigma of SVM based on Gaussian kernel function
final_C, final_sigma = func.trainGaussParams(X3, y3, Xval, yval)

# Draw the decision boundary of SVM based on Gaussian kernel function for data set
func.displayGaussKernelBoundary(X3, y3, C=final_C, sigma=final_sigma)
  • Data visualization:

  • Decision boundary (Gaussian kernel, optimal parameter):


2. Spam Classification

This section includes some details of exploring "Spam Classification" using SVM.

  • The relevant functions called are described in detail in the article header "self created functions".

In this part, we will use SVM to establish a spam classifier. You need to turn every email into one n n The classifier will judge the given e-mail based on the n # dimensional feature vector x x x # is spam ( y = 1 ) (y = 1) (y=1) or not spam ( y = 0 ) (y = 0) (y=0)​​.

# 2. Spam Classifier
# Get mail content
with open('../data/emailSample1.txt', 'r') as f:
    email = f.read()
    print(email)

2.1 Preprocess Emails

You can see that the email content includes URL, email address, number and dollar symbol. Many emails contain these elements, but the specific content of each email may be different. Therefore, when dealing with mail, the method often used is to standardize data, that is, treat all URLs as the same, all numbers as the same, etc.

For example, we replace all URLs with a unique string 'httpaddr' to indicate that the email contains a URL without requiring specific URL content. This usually improves the performance of the spam classifier because spammers usually randomize URLs, so the chance of seeing any specific URL again in new spam is very small.

We can do the following:

  1. Lower casting: convert the whole message to lowercase.
  2. Stripping HTML: remove all HTML tags and keep only the content.
  3. Normalizing URLs: replace all URLs with the string "httpaddr"
  4. Normalizing Email Addresses: replace all addresses with "emailaddr"
  5. Normalizing Dollars: replace all dollar symbols ($) with "dollar"
  6. Normalizing Numbers: replace all numbers with "number"
  7. Word stemming: restore all words to etymology. For example, "discount", "discounts", "discounted" and "discounting" are replaced by "discount".
  8. Removal of non words: remove all non text types and adjust all spaces (tabs, newlines, spaces) to one space

See the user-defined function section at the head of the article for the specific code.

# 2.1 Preprocess emails
# 2.1.1 Vocabulary List
# 2.2 Extract Features from emails
feature_vector = func.email2FeatureVector(email)
print('Length of feature vector = {}\nNumber of occurred words = {}'
      .format(len(feature_vector), int(feature_vector.sum())))

Output result:

​ Length of feature vector = 1899
​ Number of occurred words = 45

2.1.1 Vocabulary List

After preprocessing the mail, we get the processed word list. Next, we choose the words we use in the classifier and which words we need to remove.

The title provides a glossary vocab Txt, which contains 1899 words often used in practice.

We need to figure out how much vocab is in the processed email Txt and returns the word in vocab Txt, that is, the index of the training words we want.

2.2 Extract Features from Emails

Extract the features of the mail and obtain a feature vector representing the mail (the length is the length of the word list, and the corresponding subscript position value is if the word exists) 1 1 1. Otherwise 0 0 0 ).

2.3 Train SVM for Spam Classification

Read the extracted feature vector and the corresponding label. It is divided into training set and test set.

# 2.3 Train SVM for Spam Classification
# Read the extracted feature vector and the corresponding label. It is divided into training set and test set.
path_train = '../data/spamTrain.mat'
path_test = '../data/spamTest.mat'
mat_train = loadmat(path_train)
mat_test = loadmat(path_test)
X_train, y_train = mat_train['X'], mat_train['y']
X_test, y_test = mat_test['Xtest'], mat_test['ytest']
# Fit the model training model
model = svm.SVC(C=0.1, kernel='linear')
model.fit(X_train, y_train)

2.4 Top Predictiors for Spam

# 2.4 significant indicators of top predictors for spam
prediction_train = model.score(X_train, y_train)
prediction_test = model.score(X_test, y_test)
print('Predictions of training: ', prediction_train)
print('Predictions of testing: ', prediction_test)

Output result:

​ Predictions of training: 0.99825

​ Predictions of testing: 0.989


Topics: Python Machine Learning Data Analysis