Data Mining Using sklearn

Posted by Jibberish on Wed, 12 Jun 2019 23:34:11 +0200

Steps of Data Mining

Data mining usually includes data acquisition, data analysis, feature engineering, training model, model evaluation and other steps.
sklearn workflow

Sklearns mainly work in dotted frames (sklearn s can also extract text features)

Sklearns'main methods fit, fit_transform, transform

Transform method is mainly used to transform features.
- From the perspective of available information, conversion can be divided into non-information conversion and information conversion.
  No Information Conversion: Conversion without the use of any other information
```
  Such as:
  Exponential, Logarithmic Function Conversion, etc.
```
  - Information conversion can be divided into unsupervised conversion and supervised conversion from whether to use target value vector or not.
    - Unsupervised Conversion: Conversion of statistical information using only features, including mean, standard deviation, boundary and so on, such as standardization, PCA dimensionality reduction, etc.
    - Supervised conversion: that is, the conversion of feature information and target value information. For example, feature selection by model, dimensionality reduction by LDA method, etc.
A summary of commonly used transformation classes can be found in the following table:

import sklearn.preprocessing as prep

import sklearn.feature_selection as fs

import sklearn.decomposition as dp

package	class	parameter list	category	fit method is useful	Explain
prep	StandardScaler	Features	Unsupervised	Y	Standardization
prep	MinMaxScaler	Features	Unsupervised	Y	Zooming
prep	Normalizer	Features	no message	N	normalization
prep	Binarizer	Features	no message	N	Quantitative feature binarization
prep	OneHotEncoder	Features	Unsupervised	Y	Qualitative feature coding
prep	Imputer	Features	Unsupervised	Y	Missing value calculation
prep	PolynomialFeatures	Features	no message	N	Polynomial transformation (fit method only generates polynomial expressions)
prep	FunctionTransformer	Features	no message	N	Custom Function transform (Custom Function Called in transform Method)
fs	VarianceThreshold	Features	Unsupervised	Y	Variance Selection Method
fs	SelectKBest	Feature/feature+target value	Unsupervised/supervised	Y	Custom feature score selection
fs	SelectKBest+chi2	Feature + target value	Supervised	Y	Chi-square test selection method
fs	RFE	Feature + target value	Supervised	Y	Recursive feature elimination method
fs	SelectFromModel	Feature + target value	Supervised	Y	Custom Model Training Selection Method
dp	PCA	Features	Unsupervised	Y	PCA dimensionality reduction
sklearn.lda	LDA	Feature + target value	Supervised	Y	LDA dimensionality reduction

The main work of fit method is to obtain feature information and target value information.

Normalizer Of fit The method is implemented as follows

def fit(self, X, y=None):
  """Do nothing and return the estimator unchanged
  This method is just there to implement the usual API and hence
  work in pipelines.
  """
  X = check_array(X, accept_sparse='csr')
  return self

Pipelining: The output of the former job is the input of the latter.
Parallel: Work can be done at the same time, using the same input, after all the work is completed, the respective output will be combined and output.

sklearn provides pipeline to complete pipeline and parallel work

key technology

  * Parallel Processing and Pipeline Processing: Integrating multiple feature processing tasks, even including model training workgroups, into one work (i.e., combining multiple objects into one object)

  * Automated parameter adjustment: reducing the tediousness of manual parameter adjustment

  * Persistence: A trained model is data stored in memory, which can be stored in a file system and then loaded directly from the file system without training.

parallel processing
Parallel processing enables multiple feature processing to proceed simultaneously. According to the different ways of reading the feature matrix, it can be divided into whole parallel processing and partial parallel processing.

Global Parallel Processing: That is, every input of parallel processing is the whole matrix.

pipeline The package provides FeatureUnion Implementing global parallel processing

from numpy import log1p
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import Binarizer
from sklearn.pipeline import FeatureUnion
#New Object to Convert Global Characteristic Matrix to Logarithmic Function
step2_1 = ('ToLog', FunctionTransformer(log1p))
#New Objects that Binarize the Global Characteristic Matrix
step2_2 = ('ToBinary', Binarizer())
#New Global Parallel Processing Objects
#The object also has fit and transform methods. Both fit and transform methods call the fit and transform methods of objects that need parallel processing in parallel.
#The parameter transformer_list is a list of objects that need to be processed in parallel. The list is a list of binary tuples. The first tuple is the name of the object and the second tuple is the object.
step2 = ('FeatureUnion', FeatureUnion(transformer_list=[step2_1, step2_2, step2_3]))

Partial Parallel Processing defines the feature matrix for each task.

stay pipeline.FeatureUnion On the basis of optimization

from sklearn.pipeline import FeatureUnion, _fit_one_transformer,   _fit_transform_one, _transform_one 
from sklearn.externals.joblib import Parallel, delayed
from scipy import sparse
import numpy as np
class FeatureUnionExt(FeatureUnion):
    def __init__(self, transformer_list, idx_list, n_jobs=1, transformer_weights=None):
        self.idx_list = idx_list
        FeatureUnion.__init__(self, transformer_list=map(lambda trans:(trans[0], trans[1]), transformer_list), n_jobs=n_jobs, transformer_weights=transformer_weights)
    def fit(self, X, y=None):
        transformer_idx_list = map(lambda trans, idx:(trans[0], trans[1], idx), self.transformer_list, self.idx_list)
        transformers = Parallel(n_jobs=self.n_jobs)(
          delayed(_fit_one_transformer)(trans, X[:,idx], y)
            for name, trans, idx in transformer_idx_list)
        self._update_transformer_list(transformers)
        return self
    def fit_transform(self, X, y=None, **fit_params):
        transformer_idx_list = map(lambda trans, idx:(trans[0], trans[1], idx), self.transformer_list, self.idx_list)
        result = Parallel(n_jobs=self.n_jobs)(
          delayed(_fit_transform_one)(trans, name, X[:,idx], y,
                                      self.transformer_weights, **fit_params)
            for name, trans, idx in transformer_idx_list)
        Xs, transformers = zip(*result)
        self._update_transformer_list(transformers)
        if any(sparse.issparse(f) for f in Xs):
            Xs = sparse.hstack(Xs).tocsr()
        else:
            Xs = np.hstack(Xs)
        return Xs
    def transform(self, X):
        transformer_idx_list = map(lambda trans, idx:(trans[0], trans[1], idx), self.transformer_list, self.idx_list)
        Xs = Parallel(n_jobs=self.n_jobs)(
          delayed(_transform_one)(trans, name, X[:,idx], self.transformer_weights)
            for name, trans, idx in transformer_idx_list)
        if any(sparse.issparse(f) for f in Xs):
            Xs = sparse.hstack(Xs).tocsr()
        else:
            Xs = np.hstack(Xs)
        return Xs

In the scenario presented in this paper, we encode the first column of the feature matrix (flower color), transform the logarithmic functions of the second, third and fourth columns, and binarize the fifth column quantitatively. The code for partial parallel processing using the FeatureUnionExt class is as follows:

from numpy import log1p
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import Binarizer
#New Object for Qualitative Characteristic Coding Based on Partial Characteristic Matrix
step2_1 = ('OneHotEncoder', OneHotEncoder(sparse=False))
#New Object to Convert Partial Characteristic Matrix to Logarithmic Function
step2_2 = ('ToLog', FunctionTransformer(log1p))
#New Objects that Binarize Partial Characteristic Matrix
step2_3 = ('ToBinary', Binarizer())
#New Partial Parallel Processing Objects
#The parameter transformer_list is a list of objects that need to be processed in parallel. The list is a list of binary tuples. The first tuple is the name of the object and the second tuple is the object.
#The parameter idx_list is the column of the corresponding feature matrix to be read
step2 = ('FeatureUnionExt', FeatureUnionExt(transformer_list=[step2_1, step2_2, step2_3], idx_list=[[0], [1, 2, 3], [4]]))

Pipeline treatment

pipeline Provided Pipeline Class to implement pipeline processing.
In addition to the last job on the pipeline, all other tasks must be carried out. fit_transform Method, and the last work output is the input of the next work. The last job must be done. fit Method, the input is the output of the previous work; but not necessarily transform Method, because the last job of the pipeline may be training!

from numpy import log1p
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import Binarizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
#New Object for Calculating Missing Values
step1 = ('Imputer', Imputer())
#New Object for Qualitative Characteristic Coding Based on Partial Characteristic Matrix
step2_1 = ('OneHotEncoder', OneHotEncoder(sparse=False))
#New Object to Convert Partial Characteristic Matrix to Logarithmic Function
step2_2 = ('ToLog', FunctionTransformer(log1p))
#New Objects that Binarize Partial Characteristic Matrix
step2_3 = ('ToBinary', Binarizer())
#Create a new partial parallel processing object with a return value of a merge of the output of each parallel work
step2 = ('FeatureUnionExt', FeatureUnionExt(transformer_list=[step2_1, step2_2, step2_3], idx_list=[[0], [1, 2, 3], [4]]))
#New dimensionless objects
step3 = ('MinMaxScaler', MinMaxScaler())
#New Objects for Chi-square Check Selection
step4 = ('SelectKBest', SelectKBest(chi2, k=3))
#New PCA Dimension Reduction Objects
step5 = ('PCA', PCA(n_components=2))
#The object of new logistic regression is the model to be trained as the last step of pipeline.
step6 = ('LogisticRegression', LogisticRegression(penalty='l2'))
#New Pipeline Processing Objects
#The parameter step is a list of objects that need pipelined processing. The list is a list of binary tuples. The first tuple is the name of the object and the second tuple is the object.
pipeline = Pipeline(steps=[step1, step2, step3, step4, step5, step6])

Automatic parameter adjustment

grid_search The package provides tools for automated parameter tuning, including GridSearchCV Class. Training and parameter adjustment of the combined objects

from sklearn.grid_search import GridSearchCV
#New grid search object
#The first parameter is the model to be trained.
#param_grid is a grid composed of parameters to be tuned, in dictionary format, with keys as parameter names (format "object name subobject name parameter name"), and values as a list of desirable parameter values.
grid_search = GridSearchCV(pipeline, param_grid={'FeatureUnionExt__ToBinary__threshold':[1.0, 2.0, 3.0, 4.0], 'LogisticRegression__C':[0.1, 0.2, 0.4, 0.8]})
#Training and parameter adjustment
grid_search.fit(iris.data, iris.target)

Persistence

externals.joblib The package provides dump and load Method to persist and load memory data:

#Persistent data
#The first parameter is the object in memory
#The second parameter is the name saved in the file system
#The third parameter is the compression level, 0 is uncompressed, and 3 is the appropriate compression level.
dump(grid_search, 'grid_search.dmp', compress=3)
#Loading data from the file system into memory
grid_search = load('grid_search.dmp')

Topics: Lambda

Programmer Think

Data Mining Using sklearn

Hot Topics