Untitled-0720 records a complete machine learning project

Posted by Centrek on Sat, 15 Jan 2022 14:46:46 +0100

#Effective switching between py2 and py3
from __future__ import division, print_function, unicode_literals
import numpy as np
import os

preface

When carrying out a complete machine learning project, the following steps need to be completed

  • Understand the complete project
  • get data
  • Discover visual data, discover rules
  • Find the appropriate machine learning algorithm and prepare the corresponding data
  • Select the model for training
  • Fine tuning model
  • Give solutions
  • Deployment, monitoring and maintenance

First, understand the data

  • Model objective: to establish a California house price prediction model based on California census data
  • Data content: population, median income, median house price and other indicators of each block group
  • Delimitation problem: building a model is not the ultimate goal, predicting house prices is the goal, and the predicted house prices will be further used in the next investment model
  • Determine the reference performance P: the error rate calculated manually is about 15%.
  • Sort out the problem: the type is supervised learning, the multivariable regression problem is carried out, and the goal is prediction
  • Select performance index: the typical index of regression problem is root mean square error (RMSE)
    R M S E    =    1 m ∑ i = 1 m ( h ( x ( i ) ) − y ( i ) ) 2 RMSE\,\,=\,\,\sqrt{\frac{1}{m}\sum_{i=1}^m{\left( h\left( x^{\left( i \right)} \right) -y^{\left( i \right)} \right) ^2}} RMSE=m1​∑i=1m​(h(x(i))−y(i))2 ​
  • Suppose there are many abnormal blocks. At this point, you may need to use the average absolute error
    M A E ( X , h ) = 1 m ∑ i = 1 m ∣ h ( x ( i ) ) − y ( i ) ∣ MAE\left( X,h \right) =\frac{1}{m}\sum_{i=1}^m{|h\left( x^{\left( i \right)} \right) -y^{\left( i \right)}|} MAE(X,h)=m1​∑i=1m​∣h(x(i))−y(i)∣
np.random.seed(42)

%matplotlib inline

import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc("axes",labelsize=14)
mpl.rc('xtick',labelsize=12)
mpl.rc("ytick",labelsize=12)


#Establish storage file location
ROOT ='.'
DATE ='0720'
IMAGE = os.path.join(ROOT,'images',DATE)
os.makedirs(IMAGE,exist_ok  =True)

# Create save map location
def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGE, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

GET DATA

import pandas as pd
import os
import tarfile

housing = pd.read_csv('./housing.csv')
housing.head()
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
0-122.2337.8841.0880.0129.0322.0126.08.3252452600.0NEAR BAY
1-122.2237.8621.07099.01106.02401.01138.08.3014358500.0NEAR BAY
2-122.2437.8552.01467.0190.0496.0177.07.2574352100.0NEAR BAY
3-122.2537.8552.01274.0235.0558.0219.05.6431341300.0NEAR BAY
4-122.2537.8552.01627.0280.0565.0259.03.8462342200.0NEAR BAY

View specific information of data

housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
  • 0-1 longitude&latitude
  • 2- median age
  • 3 - number of houses
  • 4 - number of bedrooms
  • 5 - population
  • 6 - number of accounts
  • 7 - median income
  • 8 - median house price
  • 9 - sea view

Combing scalar data

housing['ocean_proximity'].value_counts()
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64
  • landlocked
  • offshore
  • Near Port
  • islands

descriptive statistics

housing.describe()
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_value
count20640.00000020640.00000020640.00000020640.00000020433.00000020640.00000020640.00000020640.00000020640.000000
mean-119.56970435.63186128.6394862635.763081537.8705531425.476744499.5396803.870671206855.816909
std2.0035322.13595212.5855582181.615252421.3850701132.462122382.3297531.899822115395.615874
min-124.35000032.5400001.0000002.0000001.0000003.0000001.0000000.49990014999.000000
25%-121.80000033.93000018.0000001447.750000296.000000787.000000280.0000002.563400119600.000000
50%-118.49000034.26000029.0000002127.000000435.0000001166.000000409.0000003.534800179700.000000
75%-118.01000037.71000037.0000003148.000000647.0000001725.000000605.0000004.743250264725.000000
max-114.31000041.95000052.00000039320.0000006445.00000035682.0000006082.00000015.000100500001.000000
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
save_fig("attribute_histogram_plots")
plt.show()
Saving figure attribute_histogram_plots

np.random.seed(42)

Training data: Division

from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
#test_size=0.2 test set accounts for 20%
####################! [Please add picture description]( https://csdn-img-blog.oss-cn-beijing.aliyuncs.com/cb73e934f4cb449291811189491fcb57.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80NTUzNjkzNg==,size_16,color_FFFFFF,t_70)
#############################################################################
# train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

#################################################################################################
train_set.head(5)
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
14196-117.0332.7133.03126.0627.02300.0623.03.2596103000.0NEAR OCEAN
8267-118.1633.7749.03382.0787.01314.0756.03.8125382100.0NEAR OCEAN
17445-120.4834.664.01897.0331.0915.0336.04.1563172600.0NEAR OCEAN
14265-117.1132.6936.01421.0367.01418.0355.01.942593400.0NEAR OCEAN
2271-119.8036.7843.02382.0431.0874.0380.03.554296500.0INLAND
test_set.head()
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomemedian_house_valueocean_proximity
20046-119.0136.0625.01505.0NaN1392.0359.01.681247700.0INLAND
3024-119.4635.1430.02943.0NaN1565.0584.02.531345800.0INLAND
15663-122.4437.8052.03830.0NaN1310.0963.03.4801500001.0NEAR BAY
20484-118.7234.2817.03051.0NaN1705.0495.05.7376218600.0<1H OCEAN
9814-121.9336.6234.02351.0NaN1063.0428.03.7250278000.0NEAR OCEAN
housing["median_income"].hist()
<AxesSubplot:>

housing['income_cat']= pd.cut(housing['median_income'],bins=[0., 1.5, 3.0, 4.5, 6., np.inf],labels=[1, 2, 3, 4, 5])

housing['income_cat'].value_counts()
3    7236
2    6581
4    3639
5    2362
1     822
Name: income_cat, dtype: int64
housing['income_cat'].hist()
<AxesSubplot:>

https://blog.csdn.net/Cicome/article/details/79153268

Stratified sampling using structured shufflesplit

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]
###########################################################################
train_index, test_index in split.split(housing, housing["income_cat"])
print(len(train_index),len(test_index))
16512 4128


D:\anaconda\lib\site-packages\ipykernel_launcher.py:2: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
  
D:\anaconda\lib\site-packages\ipykernel_launcher.py:2: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
strat_test_set["income_cat"].value_counts() / len(strat_test_set)
3    0.350533
2    0.318798
4    0.176357
5    0.114583
1    0.039729
Name: income_cat, dtype: float64
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

Data exploration

housing = strat_train_set.copy()
housing.plot(kind="scatter", x="longitude", y="latitude",alpha=0.2)
save_fig("bad_visualization_plot")
Saving figure bad_visualization_plot

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100, label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
    sharex=False)
plt.legend()
save_fig("housing_prices_scatterplot")
Saving figure housing_prices_scatterplot

import os
import tarfile
import urllib.request

PROJECT_ROOT_DIR = "."

images_path = os.path.join(PROJECT_ROOT_DIR, "images", "end_to_end_project")
os.makedirs(images_path, exist_ok=True)
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
filename = "california.png"
print("Downloading", filename)
url = DOWNLOAD_ROOT + "images/end_to_end_project/" + filename
urllib.request.urlretrieve(url, os.path.join(images_path, filename))
Downloading california.png





('.\\images\\end_to_end_project\\california.png',
 <http.client.HTTPMessage at 0x22a301f6648>)
import matplotlib.image as mpimg

california_img=mpimg.imread(PROJECT_ROOT_DIR + '/images/end_to_end_project/california.png')
ax = housing.plot(kind="scatter", x="longitude", y="latitude", figsize=(10,7),
                       s=housing['population']/100, label="Population",
                       c="median_house_value", cmap=plt.get_cmap("jet"),
                       colorbar=False, alpha=0.4,
                      )
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5,
           cmap=plt.get_cmap("jet"))
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)


prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)
cbar.set_label('Median House Value', fontsize=16)

plt.legend(fontsize=16)
save_fig("california_housing_prices_plot")
plt.show()
D:\anaconda\lib\site-packages\ipykernel_launcher.py:18: UserWarning: FixedFormatter should only be used together with FixedLocator


Saving figure california_housing_prices_plot

correlation coefficient

corr_matrix = housing.corr()
#Correlation between each attribute and median house price
corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value    1.000000
median_income         0.688075
total_rooms           0.134153
housing_median_age    0.105623
households            0.065843
total_bedrooms        0.049686
population           -0.024650
longitude            -0.045967
latitude             -0.144160
Name: median_house_value, dtype: float64
from pandas.plotting import scatter_matrix


attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
save_fig("scatter_matrix_plot")

Saving figure scatter_matrix_plot

housing.plot(kind="scatter", x="median_income", y="median_house_value",
             alpha=0.1)
plt.axis([0, 16, 0, 550000])
save_fig("income_vs_house_value_scatterplot")
Saving figure income_vs_house_value_scatterplot

Attribute combination test

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value          1.000000
median_income               0.688075
rooms_per_household         0.151948
total_rooms                 0.134153
housing_median_age          0.105623
households                  0.065843
total_bedrooms              0.049686
population_per_household   -0.023737
population                 -0.024650
longitude                  -0.045967
latitude                   -0.144160
bedrooms_per_room          -0.255880
Name: median_house_value, dtype: float64

median_house_value

  • median_income 0.687160
  • bedrooms_per_room -0.259984
  • rooms_per_household 0.146285
  • total_rooms 0.135097
  • housing_median_age 0.114110

Prepare the data for Machine Learning algorithms

  • It is convenient to convert duplicate data on any data set (for example, the next time you get a new data set).
  • Slowly build a conversion function library, which can be reused in future projects.
housing = strat_train_set.drop("median_house_value", axis=1) # It does not affect the original data, but only establishes a backup
housing_labels = strat_train_set["median_house_value"].copy()

Data cleaning

For missing values:

  • Remove the corresponding block;
  • Remove the entire attribute;
  • Assign values (0, mean, median, etc.).
housing.dropna(subset=["total_bedrooms"]) # Option 1
housing.drop("total_bedrooms", axis=1) # Option 2
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median) # Option 3

If you select option 3, you need to calculate the median of the training set, fill in the missing values of the training set with the median, and don't forget to save
The median. When the test set is used to evaluate the system later, the missing value in the test set needs to be replaced, and it can also be used to replace the new value in real time
Missing value in data

Scikit learn provides a convenient class to handle missing values: Imputer

from sklearn.preprocessing import Imputer
imputer = Imputer(strategy="median")
housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)
imputer.statistics_
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns)

sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
#df.isnull().any() determines which columns contain missing values
sample_incomplete_rows
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_incomeocean_proximity
4629-118.3034.0718.03759.0NaN3296.01462.02.2708<1H OCEAN
6068-117.8634.0116.04632.0NaN3038.0727.05.1762<1H OCEAN
17923-121.9737.3530.01955.0NaN999.0386.04.6328<1H OCEAN
13656-117.3034.056.02155.0NaN1039.0391.01.6675INLAND
19252-122.7938.487.06837.0NaN3468.01405.03.1662<1H OCEAN

Visible total_ Missing value for bedrooms

So use the median instead

Just take the median of all the features in the data set together

Saved in impater

Here we use exceptions to deal with the problem of sklearn version change

try:
    from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+
except ImportError:
    from sklearn.preprocessing import Imputer as SimpleImputer
    
imputer = SimpleImputer(strategy="median")

    

housing_num = housing.drop('ocean_proximity', axis=1)#Delete the scalar data, otherwise you can't calculate the median uniformly
imputer.fit(housing_num)
SimpleImputer(strategy='median')
#Look at the data
imputer.statistics_
array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409])
X = imputer.transform(housing_num)#Replace missing values with median
housing_num
longitudelatitudehousing_median_agetotal_roomstotal_bedroomspopulationhouseholdsmedian_income
17606-121.8937.2938.01568.0351.0710.0339.02.7042
18632-121.9337.0514.0679.0108.0306.0113.06.4214
14650-117.2032.7731.01952.0471.0936.0462.02.8621
3230-119.6136.3125.01847.0371.01460.0353.01.8839
3555-118.5934.2317.06592.01525.04459.01463.03.0347
...........................
6563-118.1334.2046.01271.0236.0573.0210.04.9312
12053-117.5633.8840.01196.0294.01052.0258.02.0682
13908-116.4034.099.04855.0872.02098.0765.03.2723
11159-118.0133.8231.01960.0380.01356.0356.04.0625
15775-122.4537.7752.03095.0682.01269.0639.03.5750

16512 rows × 8 columns

housing_tr = pd.DataFrame(X, columns=housing_num.columns,
                          index=housing.index)

Scalar data processing

housing_cat = housing[['ocean_proximity']]
housing_cat.head(10)
ocean_proximity
17606<1H OCEAN
18632<1H OCEAN
14650NEAR OCEAN
3230INLAND
3555<1H OCEAN
19480INLAND
8879<1H OCEAN
13685INLAND
4937<1H OCEAN
4861<1H OCEAN
try:
    from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20
    from sklearn.preprocessing import OneHotEncoder
except ImportError:
    from future_encoders import OneHotEncoder 

cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot
array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])
housing.columns
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'ocean_proximity'],
      dtype='object')

Create your own class

Custom converter:

Create a class that implements fit()[return self], transform(), and fit_transform()

By adding TransformerMixin as the base class (parent class), you can easily get fit_transform().

In addition, if you add BaseEstimator as the base class (parent class) (and avoid using * args and * * kargs in the constructor), you can get two additional methods (get_params() and set)_ params() ),

Both of them can be easily adjusted automatically

The sklearn project can be regarded as a big tree. Various estimators are fruits, and the backbone supporting these estimators is one of the few base classes. Several common classes are BaseEstimator, BaseSGD, ClassifierMixin, RegressorMixin, and so on. [origin]https://www.cnblogs.com/learn-the-hard-way/p/12532888.html

np.c_ Connect several array s together

from sklearn.base import BaseEstimator, TransformerMixin

# get the right column indices: safer than hard-coding indices 3, 4, 5, 6
rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kwargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self  # nothing else to do
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,
                         bedrooms_per_room]
        #  np.c_  Connect several array s together
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

Another method: FunctionTransformer

from sklearn.preprocessing import FunctionTransformer

def add_extra_features(X, add_bedrooms_per_room=True):
    rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
    population_per_household = X[:, population_ix] / X[:, household_ix]
    if add_bedrooms_per_room:
        bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
        return np.c_[X, rooms_per_household, population_per_household,
                     bedrooms_per_room]
    else:
        return np.c_[X, rooms_per_household, population_per_household]

attr_adder = FunctionTransformer(add_extra_features, validate=False,
                                 kw_args={"add_bedrooms_per_room": False})
housing_extra_attribs = attr_adder.fit_transform(housing.values)
housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs,
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"],
    index=housing.index)
housing_extra_attribs.columns
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'ocean_proximity', 'rooms_per_household', 'population_per_household'],
      dtype='object')
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),#1.Remove missing values  #Execute imputer = simpleimputer (strategy = "intermediate") imputer fit(housing_num)
        ('attribs_adder', FunctionTransformer(add_extra_features, validate=False)),#2.Add attribute     #FunctionTransformer(add_extra_features, validate=False)
        ('std_scaler', StandardScaler()),#3. Feature scaling
    ])

housing_num_tr = num_pipeline.fit_transform(housing_num)
Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income'],
      dtype='object')

Feature scaling

An important link in machine learning
Do not update temporarily

Conversion pipeline

Use pipline to execute programs in a certain order

try:
    from sklearn.compose import ColumnTransformer
except ImportError:
    from future_encoders import ColumnTransformer # Scikit-Learn < 0.20
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared
array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])
housing_prepared.shape
(16512, 16)
from sklearn.base import BaseEstimator, TransformerMixin

# Create a class to select numerical or categorical columns 
class OldDataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

old_num_pipeline = Pipeline([
        ('selector', OldDataFrameSelector(num_attribs)),
        ('imputer', SimpleImputer(strategy="median")),
        ('attribs_adder', FunctionTransformer(add_extra_features, validate=False)),
        ('std_scaler', StandardScaler()),
    ])

old_cat_pipeline = Pipeline([
        ('selector', OldDataFrameSelector(cat_attribs)),
        ('cat_encoder', OneHotEncoder(sparse=False)),
    ])
from sklearn.pipeline import FeatureUnion

old_full_pipeline = FeatureUnion(transformer_list=[
        ("num_pipeline", old_num_pipeline),
        ("cat_pipeline", old_cat_pipeline),
    ])
old_housing_prepared = old_full_pipeline.fit_transform(housing)
old_housing_prepared
array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])
np.allclose(housing_prepared, old_housing_prepared)
True

Select and train models

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
LinearRegression()
some_data=housing.iloc[:5]
some_labels=housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", lin_reg.predict(some_data_prepared))
Predictions: [210644.60459286 317768.80697211 210956.43331178  59218.98886849
 189747.55849879]
from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
68628.19819848923

68628.19819848923
The median housing value is between 120000 and 265000
So 68628 is still too big!!!!
It means that there is a problem of under fitting in this model

The main method to repair under fitting is to select a more powerful model, provide better features for the training algorithm, or remove the restrictions on the model. This model has not been regularized, so the last option is excluded. You can try to add more features (for example, logarithm of population), but first let's try a more complex model to see the effect.

Try the decision tree

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)
DecisionTreeRegressor(random_state=42)
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
0.0

RMSE is 0!!!!!!!!!!!!!!!!!!!

Serious over fitting warning!!!!!!!!!!!!!

Before we select the model, we can't touch the test set at all!!!!!!!

Cross validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(tree_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)

Special comments!!!!!!!!!!!!!!
The function used for the cross validation score is the utility function, not the cost function, so the larger the score, the better (negative number)
Calculate - scores before calculating the square root.

def display_scores(scores):
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard deviation:", scores.std())

display_scores(tree_rmse_scores)
Scores: [70194.33680785 66855.16363941 72432.58244769 70758.73896782
 71115.88230639 75585.14172901 70262.86139133 70273.6325285
 75366.87952553 71231.65726027]
Mean: 71407.68766037929
Standard deviation: 2439.4345041191004

It looks worse than the linear regression model! Note that cross validation not only gives you an assessment of the performance of the model, but also measures the accuracy of the assessment (i.e., its standard deviation).

The score of the decision tree is about 71407, which usually fluctuates by ± 2439.

lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
                         scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)
Scores: [66782.73843989 66960.118071   70347.95244419 74739.57052552
 68031.13388938 71193.84183426 64969.63056405 68281.61137997
 71552.91566558 67665.10082067]
Mean: 69052.46136345083
Standard deviation: 2731.674001798349

The score of linear regression is about 69052, which usually fluctuates by ± 2731.

The over fitting of decision tree model is very serious, and its performance is worse than that of linear regression model.

from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)
forest_reg.fit(housing_prepared, housing_labels)
RandomForestRegressor(n_estimators=10, random_state=42)
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse
21933.31414779769
from sklearn.model_selection import cross_val_score

forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
                                scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
Scores: [51646.44545909 48940.60114882 53050.86323649 54408.98730149
 50922.14870785 56482.50703987 51864.52025526 49760.85037653
 55434.21627933 53326.10093303]
Mean: 52583.72407377466
Standard deviation: 2298.353351147122

The score of random forest is about 52583, which usually fluctuates by ± 2298.

The score here refers to the score of the verification set

The score of the training set is lower than that of the verification set, indicating that it is over fitted

Over fitting can be solved by simplifying the model, limiting the model (i.e., regularization), or using more training data.

Before going deep into the random forest, you should try other types of models of machine learning algorithms (support vector machines with different cores, neural networks, etc.),

Don't spend too much time adjusting the super parameters. The goal is to make a list of possible models (two to five).

Replacing models takes precedence over tuning parameters

You should save each tested model so that it can be reused later. To ensure that there are super parameters and training parameters,
And cross validation scores, and actual predicted values. This allows you to compare the scores of different types of models and compare them
More error types. You can use the Python module pickle to easily save the scikit learn model, or use sklearn externals. Joblib, which is more efficient in serializing large NumPy arrays:

from sklearn.externals import joblib
joblib.dump(my_model, "my_model.pkl")
# then
my_model_loaded = joblib.load("my_model.pkl")

Model tuning

Suppose you now have a list of several promising models. You now need to fine tune them. Let's look at several ways to fine tune.

Grid search

One method of fine tuning is to manually adjust the super parameters until a good combination of super parameters is found. It would be very redundant to do so
For a long time, you may not have time to explore multiple combinations.

You should use scikit learn's GridSearchCV to do this search. All you need to do is tell GridSearchCV what super parameters to test and what values to test. GridSearchCV can use cross validation to test all possible combinations of super parameter values. For example, the following code searches for the best combination of RandomForestRegressor hyperparameter values:

from sklearn.model_selection import GridSearchCV

param_grid = [
    # try 12 (3×4) combinations of hyperparameters
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    # then try 6 (2×3) combinations with bootstrap set as False
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
  ]

forest_reg = RandomForestRegressor(random_state=42)
# train across 5 folds, that's a total of (12+6)*5=90 rounds of training 
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                           scoring='neg_mean_squared_error', return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
GridSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
             param_grid=[{'max_features': [2, 4, 6, 8],
                          'n_estimators': [3, 10, 30]},
                         {'bootstrap': [False], 'max_features': [2, 3, 4],
                          'n_estimators': [3, 10]}],
             return_train_score=True, scoring='neg_mean_squared_error')
grid_search.best_params_
{'max_features': 8, 'n_estimators': 30}
grid_search.best_estimator_
RandomForestRegressor(max_features=8, n_estimators=30, random_state=42)
cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
63669.11631261028 {'max_features': 2, 'n_estimators': 3}
55627.099719926795 {'max_features': 2, 'n_estimators': 10}
53384.57275149205 {'max_features': 2, 'n_estimators': 30}
60965.950449450494 {'max_features': 4, 'n_estimators': 3}
52741.04704299915 {'max_features': 4, 'n_estimators': 10}
50377.40461678399 {'max_features': 4, 'n_estimators': 30}
58663.93866579625 {'max_features': 6, 'n_estimators': 3}
52006.19873526564 {'max_features': 6, 'n_estimators': 10}
50146.51167415009 {'max_features': 6, 'n_estimators': 30}
57869.25276169646 {'max_features': 8, 'n_estimators': 3}
51711.127883959234 {'max_features': 8, 'n_estimators': 10}
49682.273345071546 {'max_features': 8, 'n_estimators': 30}
62895.06951262424 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54658.176157539405 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59470.40652318466 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
52724.9822587892 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
57490.5691951261 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51009.495668875716 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}
pd.DataFrame(grid_search.cv_results_)
mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_max_featuresparam_n_estimatorsparam_bootstrapparamssplit0_test_scoresplit1_test_score...mean_test_scorestd_test_scorerank_test_scoresplit0_train_scoresplit1_train_scoresplit2_train_scoresplit3_train_scoresplit4_train_scoremean_train_scorestd_train_score
00.0667560.0037140.0035990.00049523NaN{'max_features': 2, 'n_estimators': 3}-3.837622e+09-4.147108e+09...-4.053756e+091.519591e+0818-1.064113e+09-1.105142e+09-1.116550e+09-1.112342e+09-1.129650e+09-1.105559e+092.220402e+07
10.2142350.0075580.0102510.001075210NaN{'max_features': 2, 'n_estimators': 10}-3.047771e+09-3.254861e+09...-3.094374e+091.327062e+0811-5.927175e+08-5.870952e+08-5.776964e+08-5.716332e+08-5.802501e+08-5.818785e+087.345821e+06
20.6701760.0232810.0277100.001455230NaN{'max_features': 2, 'n_estimators': 30}-2.689185e+09-3.021086e+09...-2.849913e+091.626875e+089-4.381089e+08-4.391272e+08-4.371702e+08-4.376955e+08-4.452654e+08-4.394734e+082.966320e+06
30.1170890.0170030.0037900.00039943NaN{'max_features': 4, 'n_estimators': 3}-3.730181e+09-3.786886e+09...-3.716847e+091.631510e+0816-9.865163e+08-1.012565e+09-9.169425e+08-1.037400e+09-9.707739e+08-9.848396e+084.084607e+07
40.3661520.0562510.0095700.000797410NaN{'max_features': 4, 'n_estimators': 10}-2.666283e+09-2.784511e+09...-2.781618e+091.268607e+088-5.097115e+08-5.162820e+08-4.962893e+08-5.436192e+08-5.160297e+08-5.163863e+081.542862e+07
50.9973330.0148130.0273280.000798430NaN{'max_features': 4, 'n_estimators': 30}-2.387153e+09-2.588448e+09...-2.537883e+091.214614e+083-3.838835e+08-3.880268e+08-3.790867e+08-4.040957e+08-3.845520e+08-3.879289e+088.571233e+06
60.1384940.0075380.0039890.00109363NaN{'max_features': 6, 'n_estimators': 3}-3.119657e+09-3.586319e+09...-3.441458e+091.893056e+0814-9.245343e+08-8.886939e+08-9.353135e+08-9.009801e+08-8.624664e+08-9.023976e+082.591445e+07
70.4815760.0265110.0108720.002037610NaN{'max_features': 6, 'n_estimators': 10}-2.549663e+09-2.782039e+09...-2.704645e+091.471569e+086-4.980344e+08-5.045869e+08-4.994664e+08-4.990325e+08-5.055542e+08-5.013349e+083.100456e+06
81.5888770.0411500.0355820.007298630NaN{'max_features': 6, 'n_estimators': 30}-2.370010e+09-2.583638e+09...-2.514673e+091.285080e+082-3.838538e+08-3.804711e+08-3.805218e+08-3.856095e+08-3.901917e+08-3.841296e+083.617057e+06
90.2061540.0054400.0047920.00074883NaN{'max_features': 8, 'n_estimators': 3}-3.353504e+09-3.348552e+09...-3.348850e+091.241939e+0813-9.228123e+08-8.553031e+08-8.603321e+08-8.881964e+08-9.151287e+08-8.883545e+082.750227e+07
100.7308980.0799850.0139630.001784810NaN{'max_features': 8, 'n_estimators': 10}-2.571970e+09-2.718994e+09...-2.674041e+091.392777e+085-4.932416e+08-4.815238e+08-4.730979e+08-5.155367e+08-4.985555e+08-4.923911e+081.459294e+07
111.9273050.0840950.0317140.001329830NaN{'max_features': 8, 'n_estimators': 30}-2.357390e+09-2.546640e+09...-2.468328e+091.091662e+081-3.841658e+08-3.744500e+08-3.773239e+08-3.882250e+08-3.810005e+08-3.810330e+084.871017e+06
120.1123440.0025300.0041820.00040023False{'bootstrap': False, 'max_features': 2, 'n_est...-3.785816e+09-4.166012e+09...-3.955790e+091.900964e+0817-0.000000e+00-0.000000e+00-0.000000e+00-0.000000e+00-0.000000e+000.000000e+000.000000e+00
130.3732540.0069150.0117580.000744210False{'bootstrap': False, 'max_features': 2, 'n_est...-2.810721e+09-3.107789e+09...-2.987516e+091.539234e+0810-6.056477e-02-0.000000e+00-0.000000e+00-0.000000e+00-2.967449e+00-6.056027e-011.181156e+00
140.1469210.0077710.0045950.00079133False{'bootstrap': False, 'max_features': 3, 'n_est...-3.618324e+09-3.441527e+09...-3.536729e+097.795057e+0715-0.000000e+00-0.000000e+00-0.000000e+00-0.000000e+00-6.072840e+01-1.214568e+012.429136e+01
150.4567090.0051460.0117810.000736310False{'bootstrap': False, 'max_features': 3, 'n_est...-2.757999e+09-2.851737e+09...-2.779924e+096.286720e+077-2.089484e+01-0.000000e+00-0.000000e+00-0.000000e+00-5.465556e+00-5.272080e+008.093117e+00
160.1720730.0017930.0045940.00079243False{'bootstrap': False, 'max_features': 4, 'n_est...-3.134040e+09-3.559375e+09...-3.305166e+091.879165e+0812-0.000000e+00-0.000000e+00-0.000000e+00-0.000000e+00-0.000000e+000.000000e+000.000000e+00
170.5645070.0086740.0123610.000795410False{'bootstrap': False, 'max_features': 4, 'n_est...-2.525578e+09-2.710011e+09...-2.601969e+091.088048e+084-0.000000e+00-1.514119e-02-0.000000e+00-0.000000e+00-0.000000e+00-3.028238e-036.056477e-03

18 rows × 23 columns

Successfully fine tuned the optimal model

Tip: don't forget that you can handle the steps of data preparation like super parameters. For example, grid search can be automated
Determine whether to add a feature you are not sure about (for example, use the super parameter of the converter CombinedAttributesAdder)
Number add_bedrooms_per_room ). It can also use similar methods to automatically find and deal with outliers, missing features
The best method for tasks such as feature selection.

random search

When the search space for super parameters is large, it is best to use randomized search cv

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

param_distribs = {
        'n_estimators': randint(low=1, high=200),
        'max_features': randint(low=1, high=8),
    }

forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
                                n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)
RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(random_state=42),
                   param_distributions={'max_features': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001FAB8188D88>,
                                        'n_estimators': <scipy.stats._distn_infrastructure.rv_frozen object at 0x000001FAB8188C08>},
                   random_state=42, scoring='neg_mean_squared_error')
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
49150.70756927707 {'max_features': 7, 'n_estimators': 180}
51389.889203389284 {'max_features': 5, 'n_estimators': 15}
50796.155224308866 {'max_features': 3, 'n_estimators': 72}
50835.13360315349 {'max_features': 5, 'n_estimators': 21}
49280.9449827171 {'max_features': 7, 'n_estimators': 122}
50774.90662363929 {'max_features': 3, 'n_estimators': 75}
50682.78888164288 {'max_features': 3, 'n_estimators': 88}
49608.99608105296 {'max_features': 5, 'n_estimators': 100}
50473.61930350219 {'max_features': 3, 'n_estimators': 150}
64429.84143294435 {'max_features': 5, 'n_estimators': 2}

The relative importance of each attribute (feature) for making accurate predictions

feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
array([7.33442355e-02, 6.29090705e-02, 4.11437985e-02, 1.46726854e-02,
       1.41064835e-02, 1.48742809e-02, 1.42575993e-02, 3.66158981e-01,
       5.64191792e-02, 1.08792957e-01, 5.33510773e-02, 1.03114883e-02,
       1.64780994e-01, 6.02803867e-05, 1.96041560e-03, 2.85647464e-03])
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
#cat_encoder = cat_pipeline.named_steps["cat_encoder"] # old solution
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)
[(0.36615898061813423, 'median_income'),
 (0.16478099356159054, 'INLAND'),
 (0.10879295677551575, 'pop_per_hhold'),
 (0.07334423551601243, 'longitude'),
 (0.06290907048262032, 'latitude'),
 (0.056419179181954014, 'rooms_per_hhold'),
 (0.053351077347675815, 'bedrooms_per_room'),
 (0.04114379847872964, 'housing_median_age'),
 (0.014874280890402769, 'population'),
 (0.014672685420543239, 'total_rooms'),
 (0.014257599323407808, 'households'),
 (0.014106483453584104, 'total_bedrooms'),
 (0.010311488326303788, '<1H OCEAN'),
 (0.0028564746373201584, 'NEAR OCEAN'),
 (0.0019604155994780706, 'NEAR BAY'),
 (6.0280386727366e-05, 'ISLAND')]
final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)
final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

Calculate the confidence interval of RMSE

from scipy import stats

confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
mean = squared_errors.mean()
m = len(squared_errors)

np.sqrt(stats.t.interval(confidence, m - 1,
                         loc=np.mean(squared_errors),
                         scale=stats.sem(squared_errors)))

array([45685.10470776, 49691.25001878])

Topics: Machine Learning Project