Notes based on the book hands on machine learning with scikit learn, keras & tensorflow
2.5 data preparation of machine learning algorithm
After the previous analysis, we can start to prepare the data required for machine learning algorithm training. We use functions to execute the whole process instead of manual, which has the following advantages:
- These transformations can be easily reproduced on any dataset (for example, after the dataset is updated)
- Gradually build a function library of conversion functions, which can be reused in future projects
- These functions can be used in real-time systems to convert new data and input it to the algorithm
- You can easily try a variety of conversion methods to see which combination of conversion works best
Now, let's go back to a clean training set (copy strat_train_set again), and then separate the predictor from the label because
Here, we do not necessarily use the same conversion method for them (note that * * drop() will create a data copy, but it does not affect strat_train_set * *):
housing = strat_train_set.drop("median_house_value", axis=1) housing_labels = strat_train_set["median_house_value"].copy()
2.5.1 data cleaning
Most machine learning algorithms cannot work on missing features, so we need to create some functions to assist it.
We have found that some values of the total_bedrooms attribute are missing, so we need to use some methods to solve it. There are three options:
- Discard these corresponding areas
- Discard entire attribute
- Set the missing value to a value (0, mean or median, etc.)
We choose to use the third method, namely * * filling method - these operations can be easily completed through DataFrame's dropna(), drop() and fillna() methods * *.
housing.dropna(subset=["total_bedrooms"]) # Method 1 housing.drop("total_bedrooms", axis=1) # Method 2 median = housing["total_bedrooms"].median() # Method 3 housing["total_bedrooms"].fillna(median, inplace=True)
Select method 3. You need to calculate the median value of the training set, and then use it to fill in the missing values in the training set. At the same time, remember to save it to replace the missing values in the test set later, and it can also be used to replace the missing values of new data after the system goes online.
Although in fact, method 3 is enough, we still need to learn the data conversion class SimpleImputer provided by scikit learn, because it will be beneficial for us to build the data conversion pipeline at the end. Let's learn how to use it
1. Create a SimpleImputer instance and specify that you want to replace the missing value of the attribute with the median value of the attribute:
from sklearn.impute import SimpleImputer imputer=SimpleImputer(strategy="median")
2. Because the median value can only be calculated on the numeric attribute, we need to create a data copy without the text attribute ocean_proximity:
housing_num=housing.drop("ocean_proximity",axis=1)
3. Use the fit() method to adapt the impater instance to the training data:
imputer.fit(housing_num) #Returns the self fit method for training parameters using a dataset - not necessarily
4. The imputer here will calculate the median value of each attribute and store the result in the instance variable statistics because this function may be applied to the new dataset, and other attributes of the new dataset may also have missing values, so it is safe to apply imputer to all numeric attributes:
>>> imputer.statistics_ array([ -118.51 , 34.26 , 29. , 2119.5 , 433. , 1164. , 408. , 3.5409]) >>> housing_num.median().values array([ -118.51 , 34.26 , 29. , 2119.5 , 433. , 1164. , 408. , 3.5409])
5. Now, we can use this trained imputer to replace the missing value in the training set with the calculated median value.
X=imputer.transform(housing_num)
X here gets a Numpy array after feature conversion, ndarray. It's also very simple to turn it into DF.
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)
2.5.2 processing text and classification attributes
We mentioned earlier that there is a text attribute "ocean_proximity" in our dataset After exploration, we found that it is actually a category of text data, so things become easier. Machine learning algorithms generally prefer to use numbers as features, so we will convert it from text to numbers through coding. For coding - sklearn provides us with a variety of solutions, including orginalEncoder() And OneHotEncoder, here we choose to use OneHotEncoder, which is a common way to encode categories. Its advantage is that the machine learning algorithm will not think that there is any relationship between various types, that is, it will not add some strange information, but this is also its disadvantage. Such coding is sparse (especially when there are many categories, it will lead to a large number of feature inputs, which will slow down the training speed and reduce the model performance). Perhaps adding some links between categories can make it easier for machine learning to learn the patterns it should learn.
code:
>>> housing_cat = housing[["ocean_proximity"]] #Construct a two-dimensional array to facilitate the call of cat_encoder >>> from sklearn.preprocessing import OneHotEncoder >>> cat_encoder = OneHotEncoder() >>> housing_cat_1hot = cat_encoder.fit_transform(housing_cat) >>> housing_cat_1hot #It is stored in the form of sparse matrix to save computer storage space and reduce space complexity. A toarray() method is provided, which can be converted into dense Numpy arrays #ndarray <16512x5 sparse matrix of type '<class 'numpy.float64'>' with 16512 stored elements in Compressed Sparse Row format> >>> housing_cat_1hot.toarray() array([[1., 0., 0., 0., 0.], [1., 0., 0., 0., 0.], [0., 0., 0., 0., 1.], ..., [0., 1., 0., 0., 0.], [1., 0., 0., 0., 0.], [0., 0., 0., 1., 0.]]) >>> cat_encoder.categories_ [array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'], dtype=object)]
As the sparse feature of one hot coding mentioned earlier, we can also use relevant digital features instead of category input. For example, we can use the distance from the ocean to replace the feature of ocean_proximity. Of course, a more common method of modern machine learning is to use a learnable low-dimensional vector Embedding. If you have heard of word vectors, you will I'm familiar with it. In addition to input, good coding of output may also affect performance (this will be discussed later)
2.5.3 custom converter, feature scaling and pipeline
The converter is actually a class, which generally has the following three methods:
fit(),transform(),fit_transform().
For custom converters, we generally add TransformerMixin and BaseEstimator as base classes, which can directly obtain the fit_transform() method and avoid * args or **kargs in the constructor
from sklearn.base import BaseEstimator, TransformerMixin rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6 class CombinedAttributesAdder(BaseEstimator, TransformerMixin): def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs self.add_bedrooms_per_room = add_bedrooms_per_room def fit(self, X, y=None): return self # nothing else to do def transform(self, X): rooms_per_household = X[:, rooms_ix] / X[:, households_ix] population_per_household = X[:, population_ix] / X[:, households_ix] if self.add_bedrooms_per_room: bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix] return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room] #c # splicing else: return np.c_[X, rooms_per_household, population_per_household] attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False) housing_extra_attribs = attr_adder.transform(housing.values)
The above is an example - it will help us quickly realize automatic feature combination and add to the original dataset, and provides the add_bedrooms_per_room parameter, which makes it easy for us to control and try more combinations with other converters.
Feature scaling:
We are familiar with feature scaling. Common features include MinMax scaling and data standardization (we can easily implement this function by using the sklearn.StandadScaler class)
Like all transformations, we will not apply scaler to the test set until we confirm that we have achieved good results in the cross validation set and need to conduct the final on-line test.
2.5.4 conversion pipeline
ok, it's time to build the data conversion Pipeline at last, * * building the data conversion Pipeline can not only realize the data conversion in the correct order, but also help us realize the concise visualization and automation of the whole process. * * using the Pipeline class of sklearn can help us quickly realize the whole conversion.
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler num_pipeline = Pipeline([ ('imputer', SimpleImputer(strategy="median")), ('attribs_adder', CombinedAttributesAdder()), ('std_scaler', StandardScaler()), ]) housing_num_tr = num_pipeline.fit_transform(housing_num)
The Pipeline constructor defines the sequence of steps through a series of name / estimator pairs (naming can be arbitrary, as long as it does not contain double underscores and is unique, it will be useful in super parameter adjustment). Except that the last one is an estimator, the front must be a converter (that is, there must be a fit_transform() method).
Calling the fit method of the pipeline will call the fit of each converter in sequence_ The method of transform() [which is equivalent to calling fit() first and then transform()]. If the output of one call is passed to the next call method as a parameter until it is passed to the final estimator, only the fit() method will be called.
The functions provided by the pipeline are the same as those of the final estimator. For example, in this example, the last estimator is StandadScaler, so the pipeline will have transform () and fir at the same time_ transform() has two methods.
Next, we use the ColumnTransformer class of sklearn to merge the processing of category columns and data columns. It can apply appropriate transformations to each column, and the effect is better when eaten together with pandas DataFrames!
The usage rules of the constructor of ColumnTransformer are as follows: it requires a tuple list, and each tuple contains a name, a converter (pipeline), and a list of column names (or indexes) that the converter needs to apply.
from sklearn.compose import ColumnTransformer num_attribs = list(housing_num) cat_attribs = ["ocean_proximity"] full_pipeline = ColumnTransformer([ ("num", num_pipeline, num_attribs), ("cat", OneHotEncoder(), cat_attribs), ]) housing_prepared = full_pipeline.fit_transform(housing)
Use full_ pipeline. fit_ After the transform (housing) method, it applies each converter to the appropriate column and combines the output along the second axis (the converter must return the same number of rows). It is worth mentioning that OneHotEncoder returns a sparse matrix and num_pipeline returns a dense matrix. When the two are mixed together, the ColumnTransformer estimates the density of the final matrix (i.e. the non-zero ratio of cells). If the density is lower than the given threshold (0.3 by default), it returns a sparse matrix. In this example, it returns a dense matrix.
Figure use of other super parameters