-
Steps of Data Mining
Data mining usually includes data acquisition, data analysis, feature engineering, training model, model evaluation and other steps.
-
sklearn workflow
Sklearns mainly work in dotted frames (sklearn s can also extract text features)
-
Sklearns'main methods fit, fit_transform, transform
Transform method is mainly used to transform features.
-
From the perspective of available information, conversion can be divided into non-information conversion and information conversion.
No Information Conversion: Conversion without the use of any other information
Such as: Exponential, Logarithmic Function Conversion, etc.
-
Information conversion can be divided into unsupervised conversion and supervised conversion from whether to use target value vector or not.
- Unsupervised Conversion: Conversion of statistical information using only features, including mean, standard deviation, boundary and so on, such as standardization, PCA dimensionality reduction, etc.
- Supervised conversion: that is, the conversion of feature information and target value information. For example, feature selection by model, dimensionality reduction by LDA method, etc.
-
Information conversion can be divided into unsupervised conversion and supervised conversion from whether to use target value vector or not.
-
-
A summary of commonly used transformation classes can be found in the following table:
import sklearn.preprocessing as prep
import sklearn.feature_selection as fs
import sklearn.decomposition as dp
package | class | parameter list | category | fit method is useful | Explain |
---|---|---|---|---|---|
prep | StandardScaler | Features | Unsupervised | Y | Standardization |
prep | MinMaxScaler | Features | Unsupervised | Y | Zooming |
prep | Normalizer | Features | no message | N | normalization |
prep | Binarizer | Features | no message | N | Quantitative feature binarization |
prep | OneHotEncoder | Features | Unsupervised | Y | Qualitative feature coding |
prep | Imputer | Features | Unsupervised | Y | Missing value calculation |
prep | PolynomialFeatures | Features | no message | N | Polynomial transformation (fit method only generates polynomial expressions) |
prep | FunctionTransformer | Features | no message | N | Custom Function transform (Custom Function Called in transform Method) |
fs | VarianceThreshold | Features | Unsupervised | Y | Variance Selection Method |
fs | SelectKBest | Feature/feature+target value | Unsupervised/supervised | Y | Custom feature score selection |
fs | SelectKBest+chi2 | Feature + target value | Supervised | Y | Chi-square test selection method |
fs | RFE | Feature + target value | Supervised | Y | Recursive feature elimination method |
fs | SelectFromModel | Feature + target value | Supervised | Y | Custom Model Training Selection Method |
dp | PCA | Features | Unsupervised | Y | PCA dimensionality reduction |
sklearn.lda | LDA | Feature + target value | Supervised | Y | LDA dimensionality reduction |
- The main work of fit method is to obtain feature information and target value information.
Normalizer Of fit The method is implemented as follows
def fit(self, X, y=None): """Do nothing and return the estimator unchanged This method is just there to implement the usual API and hence work in pipelines. """ X = check_array(X, accept_sparse='csr') return self
- Pipelining: The output of the former job is the input of the latter.
-
Parallel: Work can be done at the same time, using the same input, after all the work is completed, the respective output will be combined and output.
sklearn provides pipeline to complete pipeline and parallel work
-
key technology
* Parallel Processing and Pipeline Processing: Integrating multiple feature processing tasks, even including model training workgroups, into one work (i.e., combining multiple objects into one object) * Automated parameter adjustment: reducing the tediousness of manual parameter adjustment * Persistence: A trained model is data stored in memory, which can be stored in a file system and then loaded directly from the file system without training.
-
parallel processing
Parallel processing enables multiple feature processing to proceed simultaneously. According to the different ways of reading the feature matrix, it can be divided into whole parallel processing and partial parallel processing.
-
Global Parallel Processing: That is, every input of parallel processing is the whole matrix.
pipeline The package provides FeatureUnion Implementing global parallel processing
from numpy import log1p from sklearn.preprocessing import FunctionTransformer from sklearn.preprocessing import Binarizer from sklearn.pipeline import FeatureUnion #New Object to Convert Global Characteristic Matrix to Logarithmic Function step2_1 = ('ToLog', FunctionTransformer(log1p)) #New Objects that Binarize the Global Characteristic Matrix step2_2 = ('ToBinary', Binarizer()) #New Global Parallel Processing Objects #The object also has fit and transform methods. Both fit and transform methods call the fit and transform methods of objects that need parallel processing in parallel. #The parameter transformer_list is a list of objects that need to be processed in parallel. The list is a list of binary tuples. The first tuple is the name of the object and the second tuple is the object. step2 = ('FeatureUnion', FeatureUnion(transformer_list=[step2_1, step2_2, step2_3]))
-
Partial Parallel Processing defines the feature matrix for each task.
stay pipeline.FeatureUnion On the basis of optimization
from sklearn.pipeline import FeatureUnion, _fit_one_transformer, _fit_transform_one, _transform_one from sklearn.externals.joblib import Parallel, delayed from scipy import sparse import numpy as np class FeatureUnionExt(FeatureUnion): def __init__(self, transformer_list, idx_list, n_jobs=1, transformer_weights=None): self.idx_list = idx_list FeatureUnion.__init__(self, transformer_list=map(lambda trans:(trans[0], trans[1]), transformer_list), n_jobs=n_jobs, transformer_weights=transformer_weights) def fit(self, X, y=None): transformer_idx_list = map(lambda trans, idx:(trans[0], trans[1], idx), self.transformer_list, self.idx_list) transformers = Parallel(n_jobs=self.n_jobs)( delayed(_fit_one_transformer)(trans, X[:,idx], y) for name, trans, idx in transformer_idx_list) self._update_transformer_list(transformers) return self def fit_transform(self, X, y=None, **fit_params): transformer_idx_list = map(lambda trans, idx:(trans[0], trans[1], idx), self.transformer_list, self.idx_list) result = Parallel(n_jobs=self.n_jobs)( delayed(_fit_transform_one)(trans, name, X[:,idx], y, self.transformer_weights, **fit_params) for name, trans, idx in transformer_idx_list) Xs, transformers = zip(*result) self._update_transformer_list(transformers) if any(sparse.issparse(f) for f in Xs): Xs = sparse.hstack(Xs).tocsr() else: Xs = np.hstack(Xs) return Xs def transform(self, X): transformer_idx_list = map(lambda trans, idx:(trans[0], trans[1], idx), self.transformer_list, self.idx_list) Xs = Parallel(n_jobs=self.n_jobs)( delayed(_transform_one)(trans, name, X[:,idx], self.transformer_weights) for name, trans, idx in transformer_idx_list) if any(sparse.issparse(f) for f in Xs): Xs = sparse.hstack(Xs).tocsr() else: Xs = np.hstack(Xs) return Xs
In the scenario presented in this paper, we encode the first column of the feature matrix (flower color), transform the logarithmic functions of the second, third and fourth columns, and binarize the fifth column quantitatively. The code for partial parallel processing using the FeatureUnionExt class is as follows:
from numpy import log1p from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import FunctionTransformer from sklearn.preprocessing import Binarizer #New Object for Qualitative Characteristic Coding Based on Partial Characteristic Matrix step2_1 = ('OneHotEncoder', OneHotEncoder(sparse=False)) #New Object to Convert Partial Characteristic Matrix to Logarithmic Function step2_2 = ('ToLog', FunctionTransformer(log1p)) #New Objects that Binarize Partial Characteristic Matrix step2_3 = ('ToBinary', Binarizer()) #New Partial Parallel Processing Objects #The parameter transformer_list is a list of objects that need to be processed in parallel. The list is a list of binary tuples. The first tuple is the name of the object and the second tuple is the object. #The parameter idx_list is the column of the corresponding feature matrix to be read step2 = ('FeatureUnionExt', FeatureUnionExt(transformer_list=[step2_1, step2_2, step2_3], idx_list=[[0], [1, 2, 3], [4]]))
-
Pipeline treatment
pipeline Provided Pipeline Class to implement pipeline processing.
In addition to the last job on the pipeline, all other tasks must be carried out. fit_transform Method, and the last work output is the input of the next work. The last job must be done. fit Method, the input is the output of the previous work; but not necessarily transform Method, because the last job of the pipeline may be training!
from numpy import log1p from sklearn.preprocessing import Imputer from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import FunctionTransformer from sklearn.preprocessing import Binarizer from sklearn.preprocessing import MinMaxScaler from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn.decomposition import PCA from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline #New Object for Calculating Missing Values step1 = ('Imputer', Imputer()) #New Object for Qualitative Characteristic Coding Based on Partial Characteristic Matrix step2_1 = ('OneHotEncoder', OneHotEncoder(sparse=False)) #New Object to Convert Partial Characteristic Matrix to Logarithmic Function step2_2 = ('ToLog', FunctionTransformer(log1p)) #New Objects that Binarize Partial Characteristic Matrix step2_3 = ('ToBinary', Binarizer()) #Create a new partial parallel processing object with a return value of a merge of the output of each parallel work step2 = ('FeatureUnionExt', FeatureUnionExt(transformer_list=[step2_1, step2_2, step2_3], idx_list=[[0], [1, 2, 3], [4]])) #New dimensionless objects step3 = ('MinMaxScaler', MinMaxScaler()) #New Objects for Chi-square Check Selection step4 = ('SelectKBest', SelectKBest(chi2, k=3)) #New PCA Dimension Reduction Objects step5 = ('PCA', PCA(n_components=2)) #The object of new logistic regression is the model to be trained as the last step of pipeline. step6 = ('LogisticRegression', LogisticRegression(penalty='l2')) #New Pipeline Processing Objects #The parameter step is a list of objects that need pipelined processing. The list is a list of binary tuples. The first tuple is the name of the object and the second tuple is the object. pipeline = Pipeline(steps=[step1, step2, step3, step4, step5, step6])
-
Automatic parameter adjustment
grid_search The package provides tools for automated parameter tuning, including GridSearchCV Class. Training and parameter adjustment of the combined objects
from sklearn.grid_search import GridSearchCV #New grid search object #The first parameter is the model to be trained. #param_grid is a grid composed of parameters to be tuned, in dictionary format, with keys as parameter names (format "object name subobject name parameter name"), and values as a list of desirable parameter values. grid_search = GridSearchCV(pipeline, param_grid={'FeatureUnionExt__ToBinary__threshold':[1.0, 2.0, 3.0, 4.0], 'LogisticRegression__C':[0.1, 0.2, 0.4, 0.8]}) #Training and parameter adjustment grid_search.fit(iris.data, iris.target)
-
Persistence
externals.joblib The package provides dump and load Method to persist and load memory data:
#Persistent data #The first parameter is the object in memory #The second parameter is the name saved in the file system #The third parameter is the compression level, 0 is uncompressed, and 3 is the appropriate compression level. dump(grid_search, 'grid_search.dmp', compress=3) #Loading data from the file system into memory grid_search = load('grid_search.dmp')