Machine learning entry Notes code implementation

Posted by SteveFrost on Tue, 01 Feb 2022 00:44:13 +0100

I Machine learning overview

Watch the introduction video of dark horse programmer's machine learning at station B and write this note to consolidate knowledge.
Video link of station B
Relevant information can be found in the comment area of the video.

1.1 definition:

Machine learning is to automatically analyze and obtain the model from the data, and use the model to predict the unknown data.

1.2 composition of machine learning data set

Structures: eigenvalues +  target value

For example, the data structure shown in the figure below

For each row of data, we can call it**sample**
Some datasets have no target values

1.3. Classification of machine learning algorithms

	It is mainly divided into two categories: supervised learning and unsupervised learning

Supervised learning (prediction)

Definition: input data is composed of input characteristic value and target value. The output of the function can be
 A continuous value(Called regression), or the output is a finite number of discrete values (called classification).

Classification algorithm:

 k-Nearest neighbor algorithm, Bayesian classification, decision tree and random forest, logistic regression, neural network

Regression algorithm

Linear regression, ridge regression

Unsupervised learning

Definition: input data is composed of input characteristic values.

clustering k-means

1.4 machine learning development process:

II Characteristic Engineering

Use sklearn library to operate feature engineering

2.1 obtaining data sets

The following are the websites of three data sets:

Kaggle website: https://www.kaggle.com/datasets

UCI dataset website: http://archive.ics.uci.edu/ml/

Scikit learn website:
http://scikit-learn.org/stable/datasets/index.html#datasets

Or get data from the company

2.2 introduction to sklearn tool library

2.2.1 installation
Scikit-learn==0.19.1
Note: numpy, scipy and other libraries are required before downloading

Using sklearn to get started with machine learning generally uses two classic data sets, iris data set and Boston house price prediction data set.

Iris dataset:

Boston house price forecast dataset:

2.2.2 API for reading data sets

sklearn.datasets
 Load get popular dataset
datasets.load_*()
Obtain a small data set, and the data is contained in datasets in
datasets.fetch_*(data_home=None)
To obtain large-scale data sets, you need to download them from the network. The first parameter of the function is
data_home Represents the directory where the dataset is downloaded,Default is ~/scikit_learn_data/

Small data set:

sklearn.datasets.load_iris()
#Load and return iris dataset
sklearn.datasets.load_boston()
#Load and return Boston house price dataset

Big data set:

sklearn.datasets.fetch_20newsgroups(data_home=None,subset='train')
#subset: 'train' or 'test', 'all', optional. Select the dataset to load.
#"Training" of training set, "testing" of test set, and "all" of both

2.2.3 use of dataset API (take iris as an example)

Return value:

load and fetch Data product type returned datasets.base.Bunch(Dictionary format)
· data: Feature data array, yes[n_samples * n_features] Binary array of
· target: Tag array, yes n_sample One dimensional array of numpy.ndarrray array
· DESRCR: data description 
· feature_name: Special certificate name, news data, handwritten digits and regression data set are not available
· target_names: Tag name
from sklearn.datasets import load_iris
# Get iris dataset
iris = load_iris()
print("Return value of iris dataset:\n", iris)
# The return value is a Bench inherited from the dictionary
print("Eigenvalues of iris:\n", iris["data"])
print("Target value of iris:\n", iris.target)
print("Iris characteristic Name:\n", iris.feature_names)
print("Name of iris target value:\n", iris.target_names)
print("Description of iris:\n", iris.DESCR)

2.2.4 data set division:
The general data set of machine learning is divided into two parts:

·Training data: used for training and building models
·Test data: used in model verification to evaluate whether the model is effective
 Division proportion:
·Training set: 70% 80% 75%
·Test set: 30% 20% 30%

Divided API:

sklearn.model_selection.train_test_split(arrays,*options)
x: Eigenvalues of data sets
y: Label value of the dataset
test_size: The size of the test set is generally float
random_state:Random number seeds, different seeds will cause different random sampling results. If the seeds are the same, the result page is the same.
return: Test set feature, training set feature value, training label, test label (random by default)
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split


def datasets_demo():
    """
    Demonstration of iris dataset
    :return: None
    """
    # 1. Get iris dataset
    iris = load_iris()
    print("Return value of iris dataset:\n", iris)
    # The return value is a Bench inherited from the dictionary
    print("Eigenvalues of iris:\n", iris["data"])
    print("Target value of iris:\n", iris.target)
    print("Iris characteristic Name:\n", iris.feature_names)
    print("Name of iris target value:\n", iris.target_names)
    print("Description of iris:\n", iris.DESCR)

    # 2. Segmentation of iris data set
    # Eigenvalue X of training set_ Eigenvalue X of train test set_ Target value y of test training set_ Target value y of train test set_ test
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=22)
    print("x_train:\n", x_train.shape)
    # Random number seed
    x_train1, x_test1, y_train1, y_test1 = train_test_split(iris.data, iris.target, random_state=6)
    x_train2, x_test2, y_train2, y_test2 = train_test_split(iris.data, iris.target, random_state=6)
    print("If the random number seeds are inconsistent:\n", x_train == x_train1)
    print("If the random number seeds are consistent:\n", x_train1 == x_train2)

    return None

2.3 introduction to characteristic Engineering

	Feature engineering uses professional background knowledge and skills to process data, yes·Feature can play a better role in machine learning algorithm.


pandas: a very convenient tool for data reading and basic processing format.
sklearn: provides a powerful interface for feature processing.
Contents included:
feature extraction
Feature preprocessing
Feature dimensionality reduction

2.4 feature extraction

2.4.1. Why feature extraction
The original data is often high-dimensional, which contains a lot of redundant information, or very sparse or large amount of calculation. It is feasible to train with the original data, but it is often inefficient to train directly. Therefore, feature extraction is often necessary.

Note: 1 Feature extraction is mainly to solve the following three problems.
(1) Redundant information caused by strong correlation in original data features. (2) The original data is very sparse. (3) The original data dimension is huge
2. Feature extraction is to make the computer better understand the data.

2.4.2 contents:

 Dictionary feature extraction (feature discretization)
 Text feature reading
 Image feature reading (deep learning)

2.4.3 dictionary feature extraction API

sklearn.feature_extraction.DictVectorizer(sparse=True,...)
	DictVectorizer.fit_transform(X) X:Return value of dictionary or iterator containing Dictionary: sparse matrix
	DictVectorizer.inverse_transform(X) X:array Array or sparse Matrix return value:Data format before conversion
	DictVectorizer.get_feature_names() Return category name

Note: if there are categories in the features, they are generally 0ne-hot coded
EX:

#1. Instantiate the class DicVectorize
#Call fit_transform method converts data (pay attention to the returned data type)
from sklearn.feature_extraction import DicVectorize

def dict_demo():
	data = [{'city': 'Beijing','temperature':100}, {'city': 'Shanghai','temperature':60}, {'city': 'Shenzhen','temperature':30}]
    # 1. Instantiate a converter class
    transfer = DictVectorizer(sparse=False) #Here is spark = false
    # 2. Call fit_transform
    data = transfer.fit_transform(data)
    print("Returned results:\n", data)
    # Print feature name
    print("Feature Name:\n", transfer.get_feature_names())

    return None

Return result:
(0, 1)    1.0
(0, 3)    100.0
(1, 0)    1.0
(1, 3)    60.0
(2, 2)    1.0
(2, 3)    30.0
 Feature Name:
['city=Shanghai', 'city=Beijing', 'city=Shenzhen', 'temperature']

If spark = true

Return result:
Returned results:
[[   0.    1.    0.  100.]
[   1.    0.    0.   60.]
[   0.    0.    1.   30.]]
Feature Name:
['city=Shanghai', 'city=Beijing', 'city=Shenzhen', 'temperature']

2.4.4 text data feature extraction API

sklearn.feature_extraction.text.CountVectorizer(stop_words=[])     Return word frequency matrix
	CountVectorizer.fit_transform(X) X:Return value of text or iteratable object containing text string: sparse matrix
	CountVectorizer.inverse_transform(X) X:array Array or sparse Matrix return value:Data grid before conversion
	CountVectorizer.get_feature_names() Return value:Word list
	sklearn.feature_extraction.text.TfidfVectorizer

EX:

#1. Instantiate the class CountVectorizer
#2. Call fit_ The transform method inputs the data and converts it (pay attention to the return format, and use toarray() to convert the sparse matrix into an array)
from sklearn.feature_extraction.text import CountVectorizer

def text_count_demo():
    """
    Feature extraction of text, countvetorizer
    :return: None
    """
    data = ["life is short,i like like python", "life is too long,i dislike python"]
    # 1. Instantiate a converter class
    # transfer = CountVectorizer(sparse=False)
    transfer = CountVectorizer()
    # 2. Call fit_transform
    data = transfer.fit_transform(data)
    print("Results of text feature extraction:\n", data.toarray())
    print("Return feature Name:\n", transfer.get_feature_names())

    return None
Return result:
Results of text feature extraction:
[[0 1 1 2 0 1 1 0]
[1 1 1 0 1 1 0 1]]
Return feature Name:
['dislike', 'is', 'life', 'like', 'long', 'python', 'short', 'too']

If the processed data is Chinese, you need to use the jieba library

jieba.cut()
Returns a generator of words
 Need to install jieba library
pip3 install jieba

Case:
Characterize the following text

Today is cruel, tomorrow is more cruel, and the day after tomorrow is beautiful,
But most of them will die tomorrow night, so don't give up today.

The light we see from distant galaxies was emitted millions of years ago,
So when we see the universe, we are looking at its past.

If you only know something in one way, you won't really know it.
The secret of understanding the true meaning of things depends on how to connect them with what we know.

analysis:

Prepare sentences using jieba.cut Progressive participle
 instantiation  CountVectorizer
 Turn the word segmentation result into a string as fit_transform Input value of
from sklearn.feature_extraction.text import CountVectorizer
import jieba

def cut_word(text):
    """
    Chinese word segmentation
    "I Love Beijing Tiananmen "-->"I love Beijing Tiananmen Square"
    :param text:
    :return: text
    """
    # Word segmentation of Chinese string by stuttering
    text = " ".join(list(jieba.cut(text)))

    return text

def text_chinese_count_demo2():
    """
    Feature extraction of Chinese
    :return: None
    """
    data = ["One kind or another, today is cruel, tomorrow is more cruel, and the day after tomorrow is beautiful, but most of them die tomorrow night, so everyone should not give up today.",
            "The light we see from distant galaxies was emitted millions of years ago, so when we see the universe, we are looking at its past.",
            "If you only know something in one way, you won't really know it. The secret of understanding the true meaning of things depends on how to connect them with what we know."]
    # Convert the original data into the form of good words
    text_list = []
    for sent in data:
        text_list.append(cut_word(sent))
    print(text_list)

    # 1. Instantiate a converter class
    # transfer = CountVectorizer(sparse=False)
    transfer = CountVectorizer()
    # 2. Call fit_transform
    data = transfer.fit_transform(text_list)
    print("Results of text feature extraction:\n", data.toarray())
    print("Return feature Name:\n", transfer.get_feature_names())

    return None
Return results
Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/mz/tzf2l3sx4rgg6qpglfb035_r0000gn/T/jieba.cache
Loading model cost 1.032 seconds.
['One or another, today is cruel, tomorrow is more cruel, and the day after tomorrow is beautiful, but most of them die tomorrow night, so everyone 		Don't give up today.', 'The light we see from distant galaxies was emitted millions of years ago, so when we see the universe, we are looking at its past.', 'If you only know something in one way, you won't really know it. The secret of understanding the true meaning of things depends on how to connect them with what we know.']
Prefix dict has been built succesfully.
Results of text feature extraction:
 [[2 0 1 0 0 0 2 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 2 0 1 0 2 1 0 0 0 1 1 0 0 1 0]
 [0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 3 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1 0 1]
 [1 1 0 0 4 3 0 0 0 0 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 2 1 0 0 1 0 0 0]]
Return feature Name:
 ['one kind', 'can't', 'No', 'before', 'understand', 'thing', 'today', 'Just in', 'Millions of years', 'issue', 'Depending on', 'only need', 'the day after tomorrow', 'meaning', 'gross', 'how', 'If', 'universe', 'We', 'therefore', 'give up', 'mode', 'tomorrow', 'Galaxy', 'night', 'Some kind', 'cruel', 'each', 'notice', 'real', 'secret', 'absolutely', 'fine', 'contact', 'past times', 'still', 'such']

2.4.5tf IDF text feature extraction
The main idea of TF-IDF is:
If a word or phrase has a high probability of appearing in one article and rarely appears in other articles, it is considered that this word or phrase has good classification ability and is suitable for classification.
TF-IDF function:
It is used to evaluate the importance of a word to a document set or one of the documents in a corpus.

understand:
Word frequency( term frequency,tf)It refers to the frequency of a given word in the file
 Reverse document frequency( inverse document frequency,idf)Is a measure of the general importance of a word. Of a particular word idf,It can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the bottom 10 as the quotient

tfidf formula:
The final result can be understood as the degree of importance.

Note: if the total number of words in a document is 100, and words"very"Five times, then"very"The frequency of a word in the file is 5/100=0.05. And calculate the file frequency( IDF)The method is to divide the total number of files in the file set by the number of files that appear"very"Number of documents with the word. So, if"very"The word in 1,000 Documents have appeared, and the total number of documents is 10,000,000 The frequency of reverse documents is lg(10,000,000 / 1,0000)=3. last"very"For this document tf-idf Your score is 0.05 * 3=0.15. 
from sklearn.feature_extraction.text import TfidfVectorizer
import jieba

def cut_word(text):
    """
    Chinese word segmentation
    "I Love Beijing Tiananmen "-->"I love Beijing Tiananmen Square"
    :param text:
    :return: text
    """
    # Word segmentation of Chinese string by stuttering
    text = " ".join(list(jieba.cut(text)))

    return text

def text_chinese_tfidf_demo():
    """
    Feature extraction of Chinese
    :return: None
    """
    data = ["One kind or another, today is cruel, tomorrow is more cruel, and the day after tomorrow is beautiful, but most of them die tomorrow night, so everyone should not give up today.",
            "The light we see from distant galaxies was emitted millions of years ago, so when we see the universe, we are looking at its past.",
            "If you only know something in one way, you won't really know it. The secret of understanding the true meaning of things depends on how to connect them with what we know."]
    # Convert the original data into the form of good words
    text_list = []
    for sent in data:
        text_list.append(cut_word(sent))
    print(text_list)

    # 1. Instantiate a converter class
    # transfer = CountVectorizer(sparse=False)
    transfer = TfidfVectorizer(stop_words=['one kind', 'can't', 'No'])
    # 2. Call fit_transform
    data = transfer.fit_transform(text_list)
    print("Results of text feature extraction:\n", data.toarray())
    print("Return feature Name:\n", transfer.get_feature_names())

    return None
Return result:
[[ 0.          0.          0.          0.43643578  0.          0.          0.
   0.          0.          0.21821789  0.          0.21821789  0.          0.
   0.          0.          0.21821789  0.21821789  0.          0.43643578
   0.          0.21821789  0.          0.43643578  0.21821789  0.          0.
   0.          0.21821789  0.21821789  0.          0.          0.21821789
   0.        ]
 	[ 0.2410822   0.          0.          0.          0.2410822   0.2410822
   0.2410822   0.          0.          0.          0.          0.          0.
   0.          0.2410822   0.55004769  0.          0.          0.          0.
   0.2410822   0.          0.          0.          0.          0.48216441
   0.          0.          0.          0.          0.          0.2410822
   0.          0.2410822 ]
 	[ 0.          0.644003    0.48300225  0.          0.          0.          0.
   0.16100075  0.16100075  0.          0.16100075  0.          0.16100075
   0.16100075  0.          0.12244522  0.          0.          0.16100075
   0.          0.          0.          0.16100075  0.          0.          0.
   0.3220015   0.16100075  0.          0.          0.16100075  0.          0.
   0.        ]]
	Return feature Name:
 	['before', 'understand', 'thing', 'today', 'Just in', 'Millions of years', 'issue', 'Depending on', 'only need', 'the day after tomorrow', 'meaning', 'gross', 'how', 'If', 'universe', 'We', 'therefore', 'give up', 'mode', 'tomorrow', 'Galaxy', 'night', 'Some kind', 'cruel', 'each', 'notice', 'real', 'secret', 'absolutely', 'fine', 'contact', 'past times', 'still', 'such']

2.5 feature pretreatment

2.5.1 * * * * what is feature preprocessing:
The process of converting feature data into feature data more suitable for the algorithm model through some conversion functions
Understand from the following figure:

Why preprocess with features:
The units or sizes of features differ greatly, or the variance of a feature is several orders of magnitude larger than other features, which is easy to affect (dominate) the target results, so that some algorithms can not learn other features.
API:

sklearn.preprocessing

content
Dimensionless data values:
Normalization, standardization.

2.5.2 normalization

By transforming the original data, the data is mapped to (default)[0,1])between.

Formula:

understand:

Acting on each column, max Is the maximum value of a column, min Is the minimum value of a column,that X''For the final result,
mx,mi Default values for the specified interval mx Is 1,mi Is 0


Normalized API

sklearn.preprocessing.MinMaxScaler (feature_range=(0,1)... )
	MinMaxScalar.fit_transform(X)
    X:numpy array Formatted data[n_samples,n_features]
    Return value: the converted shape is the same array

Case: (appointment data)

#Part of the data. The actual data is in dating In text
milage,Liters,Consumtime,target
40920,8.326976,0.953952,3
14488,7.153469,1.673904,2
26052,1.441871,0.805124,1
75136,13.147394,0.428964,1
38344,1.669788,0.134296,1
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

def minmax_demo():
    """
    Normalization demonstration
    :return: None
    """
    data = pd.read_csv("dating.txt")
    print(data)
    # 1. Instantiate a converter class
    transfer = MinMaxScaler(feature_range=(2, 3))
    # 2. Call fit_transform
    data = transfer.fit_transform(data[['milage','Liters','Consumtime']])
    print("Results of normalization of minimum and maximum values:\n", data)

    return None

Return result:

milage     Liters  Consumtime  target
0     40920   8.326976    0.953952       3
1     14488   7.153469    1.673904       2
2     26052   1.441871    0.805124       1
3     75136  13.147394    0.428964       1
..      ...        ...         ...     ...
998   48111   9.134528    0.728045       3
999   43757   7.882601    1.332446       3

[1000 rows x 4 columns]
Results of normalization of minimum and maximum values:
 [[ 2.44832535  2.39805139  2.56233353]
 [ 2.15873259  2.34195467  2.98724416]
 [ 2.28542943  2.06892523  2.47449629]
 ..., 
 [ 2.29115949  2.50910294  2.51079493]
 [ 2.52711097  2.43665451  2.4290048 ]
 [ 2.47940793  2.3768091   2.78571804]]

Note: the maximum and minimum values vary. In addition, the maximum and minimum values are very vulnerable to outliers, so this method has poor robustness and is only suitable for traditional accurate small data scenarios.

2.5.3 standardization

Through the transformation of the original data, the data is transformed into the range of mean value 0 and standard deviation 1.

For normalization: if there are outliers that affect the maximum and minimum values, the results will obviously change
For Standardization: if there are outliers, due to a certain amount of data, a small number of outliers have little impact on the average value, so the variance change is small.

Standardized API

sklearn.preprocessing.StandardScaler( )
	·After processing, for each column, all data are clustered around the mean value 0, and the standard deviation is 1
	·StandardScaler.fit_transform(X)
	·X:numpy array Formatted data[n_samples,n_features]
	·Return value: the converted shape is the same array

Standardize the above data

import pandas as pd
from sklearn.preprocessing import StandardScaler

def stand_demo():
    """
    Standardized demonstration
    :return: None
    """
    data = pd.read_csv("dating.txt")
    print(data)
    # 1. Instantiate a converter class
    transfer = StandardScaler()
    # 2. Call fit_transform
    data = transfer.fit_transform(data[['milage','Liters','Consumtime']])
    print("Standardized results:\n", data)
    print("Average value of characteristics of each column:\n", transfer.mean_)
    print("Variance of characteristics of each column:\n", transfer.var_)

    return None

Return result:

      milage     Liters  Consumtime  target
0     40920   8.326976    0.953952       3
1     14488   7.153469    1.673904       2
2     26052   1.441871    0.805124       1
..      ...        ...         ...     ...
997   26575  10.650102    0.866627       3
998   48111   9.134528    0.728045       3
999   43757   7.882601    1.332446       3

[1000 rows x 4 columns]
Standardized results:
 [[ 0.33193158  0.41660188  0.24523407]
 [-0.87247784  0.13992897  1.69385734]
 [-0.34554872 -1.20667094 -0.05422437]
 ..., 
 [-0.32171752  0.96431572  0.06952649]
 [ 0.65959911  0.60699509 -0.20931587]
 [ 0.46120328  0.31183342  1.00680598]]
Average value of characteristics of each column:
 [  3.36354210e+04   6.55996083e+00   8.32072997e-01]
Variance of characteristics of each column:
 [  4.81628039e+08   1.79902874e+01   2.46999554e-01]

**Note: * * it is relatively stable when there are enough samples, which is suitable for modern noisy big data scenarios.

2.6 feature dimensionality reduction

2.6.1 what is dimensionality reduction
Dimensionality reduction refers to the process of reducing the number of random variables (characteristics) under certain limited conditions to obtain a group of "irrelevant" main variables
It mainly reduces the number of random variables and related characteristics

Why should dimension be reduced by 6.2
It is precisely because in training, we all use features for learning. If the features have problems or the correlation between features is strong, it will have a great impact on the algorithm learning and prediction
There are two ways:
Feature selection, principal component analysis
2.6.3 feature selection
What is feature selection:

The data contains redundant or irrelevant variables (or features, attributes, indicators, etc.) in order to find out the main features from the original features.

method:

Filter(Filter type): It mainly explores the characteristics of the feature itself, the relationship between the feature and the feature and the target value
 Variance selection method: low variance feature filtering
 correlation coefficient
Embedded (Embedded): Algorithm automatically selects features (association between features and target values)
Decision tree:Information entropy, information gain
 Regularization: L1,L2
 Deep learning: convolution, etc

Filter type:

Delete some features of low variance, and then consider the angle of this method in combination with the size of variance.
Small characteristic variance: the values of most samples of a certain characteristic are relatively similar
 Large characteristic variance: the values of many samples of a certain characteristic are different

API

sklearn.feature_selection.VarianceThreshold(threshold = 0.0)
	Delete all low variance features
	Variance.fit_transform(X)
	X:numpy array Formatted data[n_samples,n_features]
	Return value: the training set difference is lower than threshold The feature will be deleted. The default value is to retain all non-zero variance features, that is, delete features with the same value in all samples.

Case: (stock data)

index,pe_ratio,pb_ratio,market_cap,return_on_asset_net_profit,du_return_on_equity,ev,earnings_per_share,revenue,total_expense,date,return
0,000001.XSHE,5.9572,1.1818,85252550922.0,0.8008,14.9403,1211444855670.0,2.01,20701401000.0,10882540000.0,2012-01-31,0.027657228229937388
1,000002.XSHE,7.0289,1.588,84113358168.0,1.6463,7.8656,300252061695.0,0.326,29308369223.2,23783476901.2,2012-01-31,0.08235182370820669
2,000008.XSHE,-262.7461,7.0003,517045520.0,-0.5678,-0.5943,770517752.56,-0.006,11679829.03,12030080.04,2012-01-31,0.09978900335112327
3,000060.XSHE,16.476,3.7146,19680455995.0,5.6036,14.617,28009159184.6,0.35,9189386877.65,7935542726.05,2012-01-31,0.12159482758620697
4,000069.XSHE,12.5878,2.5616,41727214853.0,2.8729,10.9097,81247380359.0,0.271,8951453490.28,7091397989.13,2012-01-31,-0.0026808154146886697

analysis:
1. Initialize VarianceThreshold
2. Call fit_transform

def variance_demo():
    """
    Delete low variance feature - feature selection
    :return: None
    """
    data = pd.read_csv("factor_returns.csv")
    print(data)
    # 1. Instantiate a converter class
    transfer = VarianceThreshold(threshold=1)
    # 2. Call fit_transform
    data = transfer.fit_transform(data.iloc[:, 1:10])
    print("Results of deleting low variance features:\n", data)
    print("Shape:\n", data.shape)

    return None

2.6.4 correlation coefficient

Pearson correlation coefficient(Pearson Correlation Coefficient)
	·Statistical indicators reflecting the close relationship between variables

Just know the formula:

characteristic:

The value of the correlation coefficient is between–1 And+1 Between, i.e–1≤ r ≤+1. Its nature is as follows:
When r>0 When, it indicates that the two variables are positively correlated, r<0 The two variables are negatively correlated
 When|r|=1 When, it means that the two variables are completely correlated r=0 When, it indicates that there is no correlation between the two variables
 When 0<|r|<1 It indicates that there is a certain degree of correlation between the two variables. And|r|The closer to 1, the closer the linear relationship between the two variables;|r|The closer to 0, the weaker the linear correlation between the two variables
 Generally, it can be divided into three levels:|r|<0.4 Low correlation; 0.4≤|r|<0.7 Significant correlation; 0.7≤|r|<1 It is highly linear correlation

Case: correlation calculation of financial indicators of stocks

import pandas as pd
from scipy.stats import pearsonr

def pearsonr_demo():
    """
    Correlation coefficient calculation
    :return: None
    """
    data = pd.read_csv("factor_returns.csv")

    factor = ['pe_ratio', 'pb_ratio', 'market_cap', 'return_on_asset_net_profit', 'du_return_on_equity', 'ev',
              'earnings_per_share', 'revenue', 'total_expense']

    for i in range(len(factor)):
        for j in range(i, len(factor) - 1):
            print(
                "index%s And indicators%s The correlation between is%f" % (factor[i], factor[j + 1], pearsonr(data[factor[i]], data[factor[j + 1]])[0]))

    return None

Return partial results:

We can draw pictures and observe the results through matplotlib

2.7 principal component analysis (PCA)

2.7.1 what is principal component analysis

Definition: the process of transforming high-dimensional data into low-dimensional data. In this process, the original data may be abandoned and new variables may be created

Function: it is data dimension compression to reduce the dimension (complexity) of the original data as much as possible and lose a small amount of information.

Application: regression analysis or cluster analysis

API

sklearn.decomposition.PCA(n_components=None)
	Decompose data into lower dimensional space
	n_components:
		Decimal: indicates the percentage of information retained
		Integer: reduce to how many characteristics
	PCA.fit_transform(X) X:numpy array Formatted data[n_samples,n_features]
		Return value: the value of the specified dimension after conversion array

Case: Explore Users' preferences for item categories, subdivide and reduce dimensions

The data are as follows:
order_products__prior.csv: Order and product information
 Field: order_id, product_id, add_to_cart_order, reordered

products.csv: Commodity information
 Field: product_id, product_name, aisle_id, department_id

orders.csv: User's order information
 Field: order_id,user_id,eval_set,order_number,....

aisles.csv: Specific item category of commodity
 Field: aisle_id, aisle

analysis:

1.Consolidate tables so that user_id And aisle In a table
2.Perform crosstab transformation
3.Dimensionality reduction
import pandas as pd
from sklearn.decomposition import PCA

# 1. Get dataset
# ·Product information - products csv: 
# Fields: product_id, product_name, aisle_id, department_id
# ·Order and product information - order_products__prior.csv: 
# Fields: order_id, product_id, add_to_cart_order, reordered 
# ·User's order information - orders csv: 
# Fields: order_id, user_id,eval_set, order_number,order_dow, order_hour_of_day, days_since_prior_order 
# ·Specific item category of goods - aisles csv: 
# Fields: aisle_id, aisle     
products = pd.read_csv("./instacart/products.csv")
order_products = pd.read_csv("./instacart/order_products__prior.csv")
orders = pd.read_csv("./instacart/orders.csv")
aisles = pd.read_csv("./instacart/aisles.csv")

# 2. Merge tables, user_id and aisle are put on one table
# 1) Merge orders and order_products on=order_id tab1:order_id, product_id, user_id
tab1 = pd.merge(orders, order_products, on=["order_id", "order_id"])
# 2) Merge tab1 and products on=product_id tab2:aisle_id
tab2 = pd.merge(tab1, products, on=["product_id", "product_id"])
# 3) Merge tab2 and aisles on=aisle_id tab3:user_id, aisle
tab3 = pd.merge(tab2, aisles, on=["aisle_id", "aisle_id"])

# 3. Cross table processing_ ID and aisle are grouped
table = pd.crosstab(tab3["user_id"], tab3["aisle"])

# 4. Principal component analysis is used to reduce dimension
# 1) Instantiate a converter class PCA
transfer = PCA(n_components=0.95) #Retain 95% of information
# 2)fit_transform
data = transfer.fit_transform(table)

print(data.shape)

Return result:

(206209, 44)

III. summary chart

Topics: Machine Learning