Jupyter / Part 1 Chapter 2 / converter

Posted by Melville on Mon, 24 Jan 2022 01:00:11 +0100

converter

Most machine learning algorithms are easier to deal with numbers, so let's turn these texts into numbers first. sklearn provides a converter for such tasks.

Data to convert

# Load Library
import os
import tarfile
import pandas as pd
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
# Load data
def load_housing_data(housing_path=HOUSING_PATH):
    csv_path=os.path.join(housing_path,"housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()
housing.head() # View the first five lines of information
housing_cat = housing[['ocean_proximity']]  
housing_cat.head(10)

Text information is' < 1H ocean ',' Outland ',' ISLAND ',' NEAR BAY ',' NEAR OCEAN ', which shall be converted into digital information. Multiple ideas:

Ordinary encoder converter: directly convert '< 1H ocean', 'Outland', 'ISLAND', 'NEAR BAY', 'NEAR BAY'
'OCEAN' corresponds to numbers 1, 2, 3, 4, 5. Output as a column.
One hot transcoder: '< 1H ocean', 'Outland', 'ISLAND', 'NEAR BAY', 'NEAR BAY'
OCEAN 'is a Boolean quantity divided into 5 columns. For example, the 13769 line 'Insert' corresponds to 0,1,0,0,0.

Ordinary encoder

#Set ocean_ Convert text of proximity to numbers
from sklearn.preprocessing import OrdinalEncoder    #Common Encoder 
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]
#View ocean_ Category of text for proximity
print(ordinal_encoder.categories_)

Disadvantages: the computer will think that two similar numbers are more similar than those far away. However, this is not the case. The similarity between 0 and 4 may be higher than that between 0 and 1

One hot encoder

Return sparse matrix

from sklearn.preprocessing import OneHotEncoder #OneHot encoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot
#View ocean_ Category of text for proximity
cat_encoder.categories_

# By default, when returning, the sparse matrix only stores the position of non-0 value elements, saves memory, and is converted to Numpy array for display
housing_cat_1hot.toarray()

Returns a two-dimensional matrix

from sklearn.preprocessing import OneHotEncoder #OneHot encoder
# Specify sparse=False to return a two-dimensional matrix instead of a sparse matrix
cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot
#View ocean_ Category of text for proximity
cat_encoder.categories_

Custom converter

Use custom converters to add feature rooms_per_household population_per_household

Custom function: remove the medium_ housing_ value； Added rooms_per_household population_per_household two columns

Processing object: partial housing columns

BaseEstimator& TransformerMixin

Hard code (hard coding in English): it means that in software implementation, the output or input related parameters (such as path, output form and format) are directly written in the original code, rather than making appropriate response to the settings, resources, data or format specified by the outside world during execution.

def init: Define a class

The init() method is called by default when creating an object and does not need to be called manually

In init(self), there is one parameter named self by default

init(self,x,y) passes two arguments when creating an object, then__ init__ In (self), self is listed as the first formal parameter, and two formal parameters are required

The self parameter in init(self) does not need to be passed by the developer. The python interpreter will automatically pass in the current object reference

#Converter 1
from sklearn.base import BaseEstimator, TransformerMixin

#Converter 1
from sklearn.base import BaseEstimator, TransformerMixin  

# get the right column indices: safer than hard-coding indices 3, 4, 5, 6
#Get the data to be calculated
rooms_ix, bedrooms_ix, population_ix, household_ix = [
    list(housing.columns).index(col)
    for col in ("total_rooms", "total_bedrooms", "population", "households")]

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):   #Combined attribute adder
    
    def __init__(self, add_bedrooms_per_room = True): # no *args or **kwargs
        self.add_bedrooms_per_room = add_bedrooms_per_room
        
    def fit(self, X, y=None):
        return self  # nothing else to do
    
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, household_ix]
        population_per_household = X[:, population_ix] / X[:, household_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household,    #Pair array array1 with array array2 and output
                         bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)

FunctionTransformer

#Converter 2
from sklearn.preprocessing import FunctionTransformer

def add_extra_features(X, add_bedrooms_per_room=True):            #Add other characteristic functions add_bedrooms_per_room logo
    rooms_per_household = X[:, rooms_ix] / X[:, household_ix]     #Number of rooms per household
    population_per_household = X[:, population_ix] / X[:, household_ix]    #Number of people per household
    if add_bedrooms_per_room:                                  
        bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]           #Proportion of bedrooms in the room
        return np.c_[X, rooms_per_household, population_per_household,
                     bedrooms_per_room]
    else:
        return np.c_[X, rooms_per_household, population_per_household]

attr_adder = FunctionTransformer(add_extra_features,      #Parameter func: user defined functions and other features added
                        validate=False,          #validate: bool quantity, default=False, input validation off
                        kw_args={"add_bedrooms_per_room": False}) #kw_argsdict, default=None Dictionary of additional keyword parameters to pass to func. 
housing_extra_attribs = attr_adder.fit_transform(housing.values)

housing_extra_attribs = pd.DataFrame(
    housing_extra_attribs,
    columns=list(housing.columns)+["rooms_per_household", "population_per_household"],
    index=housing.index)
housing_extra_attribs.head()

Before conversion

After conversion

The converted IDs are not in order
median_ Neither income nor its previous columns have changed

github source code
Thank you guys!

Topics: Machine Learning

Programmer Think