Data Mining Using sklearn

Posted by Jibberish on Wed, 12 Jun 2019 23:34:11 +0200

  • Steps of Data Mining

    Data mining usually includes data acquisition, data analysis, feature engineering, training model, model evaluation and other steps.

  • sklearn workflow


Sklearns mainly work in dotted frames (sklearn s can also extract text features)

  • Sklearns'main methods fit, fit_transform, transform

    Transform method is mainly used to transform features.

    • From the perspective of available information, conversion can be divided into non-information conversion and information conversion.

      No Information Conversion: Conversion without the use of any other information

        Such as:
        Exponential, Logarithmic Function Conversion, etc.
      • Information conversion can be divided into unsupervised conversion and supervised conversion from whether to use target value vector or not.
        • Unsupervised Conversion: Conversion of statistical information using only features, including mean, standard deviation, boundary and so on, such as standardization, PCA dimensionality reduction, etc.
        • Supervised conversion: that is, the conversion of feature information and target value information. For example, feature selection by model, dimensionality reduction by LDA method, etc.
  • A summary of commonly used transformation classes can be found in the following table:

    import sklearn.preprocessing as prep

    import sklearn.feature_selection as fs

    import sklearn.decomposition as dp

package class parameter list category fit method is useful Explain
prep StandardScaler Features Unsupervised Y Standardization
prep MinMaxScaler Features Unsupervised Y Zooming
prep Normalizer Features no message N normalization
prep Binarizer Features no message N Quantitative feature binarization
prep OneHotEncoder Features Unsupervised Y Qualitative feature coding
prep Imputer Features Unsupervised Y Missing value calculation
prep PolynomialFeatures Features no message N Polynomial transformation (fit method only generates polynomial expressions)
prep FunctionTransformer Features no message N Custom Function transform (Custom Function Called in transform Method)
fs VarianceThreshold Features Unsupervised Y Variance Selection Method
fs SelectKBest Feature/feature+target value Unsupervised/supervised Y Custom feature score selection
fs SelectKBest+chi2 Feature + target value Supervised Y Chi-square test selection method
fs RFE Feature + target value Supervised Y Recursive feature elimination method
fs SelectFromModel Feature + target value Supervised Y Custom Model Training Selection Method
dp PCA Features Unsupervised Y PCA dimensionality reduction
sklearn.lda LDA Feature + target value Supervised Y LDA dimensionality reduction
  • The main work of fit method is to obtain feature information and target value information.

    Normalizer Of fit The method is implemented as follows

    def fit(self, X, y=None):
      """Do nothing and return the estimator unchanged
      This method is just there to implement the usual API and hence
      work in pipelines.
      """
      X = check_array(X, accept_sparse='csr')
      return self
  • Pipelining: The output of the former job is the input of the latter.
  • Parallel: Work can be done at the same time, using the same input, after all the work is completed, the respective output will be combined and output.

    sklearn provides pipeline to complete pipeline and parallel work

  • key technology

      * Parallel Processing and Pipeline Processing: Integrating multiple feature processing tasks, even including model training workgroups, into one work (i.e., combining multiple objects into one object)
    
      * Automated parameter adjustment: reducing the tediousness of manual parameter adjustment
    
      * Persistence: A trained model is data stored in memory, which can be stored in a file system and then loaded directly from the file system without training.
  • parallel processing

    Parallel processing enables multiple feature processing to proceed simultaneously. According to the different ways of reading the feature matrix, it can be divided into whole parallel processing and partial parallel processing.

  • Global Parallel Processing: That is, every input of parallel processing is the whole matrix.

    pipeline The package provides FeatureUnion Implementing global parallel processing

    from numpy import log1p
    from sklearn.preprocessing import FunctionTransformer
    from sklearn.preprocessing import Binarizer
    from sklearn.pipeline import FeatureUnion
    #New Object to Convert Global Characteristic Matrix to Logarithmic Function
    step2_1 = ('ToLog', FunctionTransformer(log1p))
    #New Objects that Binarize the Global Characteristic Matrix
    step2_2 = ('ToBinary', Binarizer())
    #New Global Parallel Processing Objects
    #The object also has fit and transform methods. Both fit and transform methods call the fit and transform methods of objects that need parallel processing in parallel.
    #The parameter transformer_list is a list of objects that need to be processed in parallel. The list is a list of binary tuples. The first tuple is the name of the object and the second tuple is the object.
    step2 = ('FeatureUnion', FeatureUnion(transformer_list=[step2_1, step2_2, step2_3]))
  • Partial Parallel Processing defines the feature matrix for each task.

    stay pipeline.FeatureUnion On the basis of optimization

    from sklearn.pipeline import FeatureUnion, _fit_one_transformer,   _fit_transform_one, _transform_one 
    from sklearn.externals.joblib import Parallel, delayed
    from scipy import sparse
    import numpy as np
    class FeatureUnionExt(FeatureUnion):
        def __init__(self, transformer_list, idx_list, n_jobs=1, transformer_weights=None):
            self.idx_list = idx_list
            FeatureUnion.__init__(self, transformer_list=map(lambda trans:(trans[0], trans[1]), transformer_list), n_jobs=n_jobs, transformer_weights=transformer_weights)
        def fit(self, X, y=None):
            transformer_idx_list = map(lambda trans, idx:(trans[0], trans[1], idx), self.transformer_list, self.idx_list)
            transformers = Parallel(n_jobs=self.n_jobs)(
              delayed(_fit_one_transformer)(trans, X[:,idx], y)
                for name, trans, idx in transformer_idx_list)
            self._update_transformer_list(transformers)
            return self
        def fit_transform(self, X, y=None, **fit_params):
            transformer_idx_list = map(lambda trans, idx:(trans[0], trans[1], idx), self.transformer_list, self.idx_list)
            result = Parallel(n_jobs=self.n_jobs)(
              delayed(_fit_transform_one)(trans, name, X[:,idx], y,
                                          self.transformer_weights, **fit_params)
                for name, trans, idx in transformer_idx_list)
            Xs, transformers = zip(*result)
            self._update_transformer_list(transformers)
            if any(sparse.issparse(f) for f in Xs):
                Xs = sparse.hstack(Xs).tocsr()
            else:
                Xs = np.hstack(Xs)
            return Xs
        def transform(self, X):
            transformer_idx_list = map(lambda trans, idx:(trans[0], trans[1], idx), self.transformer_list, self.idx_list)
            Xs = Parallel(n_jobs=self.n_jobs)(
              delayed(_transform_one)(trans, name, X[:,idx], self.transformer_weights)
                for name, trans, idx in transformer_idx_list)
            if any(sparse.issparse(f) for f in Xs):
                Xs = sparse.hstack(Xs).tocsr()
            else:
                Xs = np.hstack(Xs)
            return Xs

    In the scenario presented in this paper, we encode the first column of the feature matrix (flower color), transform the logarithmic functions of the second, third and fourth columns, and binarize the fifth column quantitatively. The code for partial parallel processing using the FeatureUnionExt class is as follows:

    from numpy import log1p
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.preprocessing import FunctionTransformer
    from sklearn.preprocessing import Binarizer
    #New Object for Qualitative Characteristic Coding Based on Partial Characteristic Matrix
    step2_1 = ('OneHotEncoder', OneHotEncoder(sparse=False))
    #New Object to Convert Partial Characteristic Matrix to Logarithmic Function
    step2_2 = ('ToLog', FunctionTransformer(log1p))
    #New Objects that Binarize Partial Characteristic Matrix
    step2_3 = ('ToBinary', Binarizer())
    #New Partial Parallel Processing Objects
    #The parameter transformer_list is a list of objects that need to be processed in parallel. The list is a list of binary tuples. The first tuple is the name of the object and the second tuple is the object.
    #The parameter idx_list is the column of the corresponding feature matrix to be read
    step2 = ('FeatureUnionExt', FeatureUnionExt(transformer_list=[step2_1, step2_2, step2_3], idx_list=[[0], [1, 2, 3], [4]]))
  • Pipeline treatment

    pipeline Provided Pipeline Class to implement pipeline processing.
    In addition to the last job on the pipeline, all other tasks must be carried out. fit_transform Method, and the last work output is the input of the next work. The last job must be done. fit Method, the input is the output of the previous work; but not necessarily transform Method, because the last job of the pipeline may be training!

    from numpy import log1p
    from sklearn.preprocessing import Imputer
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.preprocessing import FunctionTransformer
    from sklearn.preprocessing import Binarizer
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.feature_selection import SelectKBest
    from sklearn.feature_selection import chi2
    from sklearn.decomposition import PCA
    from sklearn.linear_model import LogisticRegression
    from sklearn.pipeline import Pipeline
    #New Object for Calculating Missing Values
    step1 = ('Imputer', Imputer())
    #New Object for Qualitative Characteristic Coding Based on Partial Characteristic Matrix
    step2_1 = ('OneHotEncoder', OneHotEncoder(sparse=False))
    #New Object to Convert Partial Characteristic Matrix to Logarithmic Function
    step2_2 = ('ToLog', FunctionTransformer(log1p))
    #New Objects that Binarize Partial Characteristic Matrix
    step2_3 = ('ToBinary', Binarizer())
    #Create a new partial parallel processing object with a return value of a merge of the output of each parallel work
    step2 = ('FeatureUnionExt', FeatureUnionExt(transformer_list=[step2_1, step2_2, step2_3], idx_list=[[0], [1, 2, 3], [4]]))
    #New dimensionless objects
    step3 = ('MinMaxScaler', MinMaxScaler())
    #New Objects for Chi-square Check Selection
    step4 = ('SelectKBest', SelectKBest(chi2, k=3))
    #New PCA Dimension Reduction Objects
    step5 = ('PCA', PCA(n_components=2))
    #The object of new logistic regression is the model to be trained as the last step of pipeline.
    step6 = ('LogisticRegression', LogisticRegression(penalty='l2'))
    #New Pipeline Processing Objects
    #The parameter step is a list of objects that need pipelined processing. The list is a list of binary tuples. The first tuple is the name of the object and the second tuple is the object.
    pipeline = Pipeline(steps=[step1, step2, step3, step4, step5, step6])
  • Automatic parameter adjustment

    grid_search The package provides tools for automated parameter tuning, including GridSearchCV Class. Training and parameter adjustment of the combined objects

    from sklearn.grid_search import GridSearchCV
    #New grid search object
    #The first parameter is the model to be trained.
    #param_grid is a grid composed of parameters to be tuned, in dictionary format, with keys as parameter names (format "object name subobject name parameter name"), and values as a list of desirable parameter values.
    grid_search = GridSearchCV(pipeline, param_grid={'FeatureUnionExt__ToBinary__threshold':[1.0, 2.0, 3.0, 4.0], 'LogisticRegression__C':[0.1, 0.2, 0.4, 0.8]})
    #Training and parameter adjustment
    grid_search.fit(iris.data, iris.target)
  • Persistence

    externals.joblib The package provides dump and load Method to persist and load memory data:

    #Persistent data
    #The first parameter is the object in memory
    #The second parameter is the name saved in the file system
    #The third parameter is the compression level, 0 is uncompressed, and 3 is the appropriate compression level.
    dump(grid_search, 'grid_search.dmp', compress=3)
    #Loading data from the file system into memory
    grid_search = load('grid_search.dmp')

Topics: Lambda