Data Exploration Visualization Library yellowbrick-tutorial for Python Machine Learning

Posted by koddos on Tue, 20 Aug 2019 11:27:31 +0200

Background introduction

When learning sklearn er, besides the difficulty of algorithm, we have to learn matplotlib visualization. For my practical application, visualization is more important. However, the ease of use and aesthetics of matplotlib are not complimentary. One after another, plotly and seaborn have been used and finally fixed in Bokeh, because it can be perfectly integrated with Flask, and the development of data Kanban is much less difficult.

Recently, I saw that this library can be more convenient to realize data exploration. Today I have to spare time to learn about it. Originally, I visited English documents, and found that some people have been doing Sinicization. Although it looks like Google translation, in the spirit of bringing in doctrine and less effort, I copied and learned half, but I still found some inconsistencies with the documents.

# http://www.scikit-yb.org/zh/latest/tutorial.html

Model Selection Course

In this tutorial, we will look at the scores of various Scikit-Learn models and compare them with Yellowbrick's visual diagnostic tools in order to select the best model for our data.

Model Selection Triple

Discussions on machine learning often focus on model selection. Whether it is logical regression, random forest, Bayesian method or artificial neural network, machine learning practitioners can usually quickly show their preferences. This is mainly due to historical reasons. Although modern third-party machine learning libraries make the deployment of various models insignificant, it has traditionally taken years to study the application and optimization of even one of the algorithms. Therefore, compared with other models, machine learning practitioners tend to have strong preferences for specific (and more likely familiar) models.

However, model selection is more subtle than simply choosing the "right" or "wrong" algorithm. The workflow in practice includes:

Select and/or design the smallest and most predictable feature set
 Select a set of algorithms from the model family, and
 Optimize the performance by superparametric optimization.

The model selection triple was first proposed by Kumar et al. in the SIGMOD paper in 2015. In their paper, they talked about the development of the next generation of database systems for predictive modeling. The author expresses very pertinently that machine learning is highly experimental in practice, so there is an urgent need for such a system. "Model selection," they explained, "is iterative and exploratory, because the space (model selection triple) is usually infinite, and it is usually impossible for analysts to know in advance which (combination) will produce satisfactory accuracy and/or insight."

Recently, many workflows have been automated through grid search methods, standardized API s and GUI-based applications. However, in practice, human intuition and guidance can focus more effectively on model quality than exhaustive search. Through the visual model selection process, data scientists can turn to the final, interpretable model and avoid traps.

Yellowbrick library is a visual diagnostic platform for machine learning, which allows data scientists to control the process of model selection. Yellowbrick extends Scikit-Learn's API: Visualizer with a new core object. Visualizers allow visual models to be matched and transformed as part of the Scikit-Learn pipeline process, thus providing visual diagnostics during the transformation of high-dimensional data.

About data

This tutorial uses a modified version of the mushroom data set from UCI Machine Learning Repository. Our goal is to predict whether mushrooms are toxic or edible based on their specificity.

These data include hypothetical sample descriptions corresponding to 23 species of roast mushrooms from Agaricus and Lepiota families. Each is determined to be absolutely edible, absolutely toxic, or unknown edibility and not recommended (the latter is combined with toxic species).

Our document "agaricus-lepiota.txt" contains three nominally valuable attribute information and a target value of 8124 mushroom instances (4208 edible and 3916 toxic).

Let's load the data with Pandas.

import os
import pandas as pd
mushrooms = 'data/shrooms.csv'  # data set
dataset   = pd.read_csv(mushrooms)
# dataset.columns = names
dataset.head()

<div>
<style scoped>

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

id class cap-shape cap-surface cap-color bruises odor gill-attachment gill-spacing gill-size ... stalk-color-above-ring stalk-color-below-ring veil-type veil-color ring-number ring-type spore-print-color population habitat Unnamed: 24
0 1 p x s n t p f c n ... w w p w o p k s u NaN
1 2 e x s y t a f c b ... w w p w o p n n g NaN
2 3 e b s w t l f c b ... w w p w o p n n m NaN
3 4 p x y w t p f c n ... w w p w o p k s u NaN
4 5 e x s g f n f w b ... w w p w o e n a g NaN

<p>5 rows × 25 columns</p>
</div>

features = ['cap-shape', 'cap-surface', 'cap-color']
target   = ['class']
X = dataset[features]
y = dataset[target]
dataset.shape # Two less mushrooms than official documents
(8122, 25)



dataset.groupby('class').count() # One less mushroom each

<div>
<style scoped>

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

id cap-shape cap-surface cap-color bruises odor gill-attachment gill-spacing gill-size gill-color ... stalk-color-above-ring stalk-color-below-ring veil-type veil-color ring-number ring-type spore-print-color population habitat Unnamed: 24
class
e 4207 4207 4207 4207 4207 4207 4207 4207 4207 4207 ... 4207 4207 4207 4207 4207 4207 4207 4207 4207 0
p 3915 3915 3915 3915 3915 3915 3915 3915 3915 3915 ... 3915 3915 3915 3915 3915 3915 3915 3915 3915 0

<p>2 rows × 24 columns</p>
</div>

feature extraction

Our data, including target parameters, are categorized data. In order to use machine learning, we need to convert these values into numerical data. In order to extract this from the data set, we must use Scikit-Learn transformers to transform the input data set into a data set suitable for the model. Fortunately, Sckit-Learn provides a converter for converting classification labels to integers: sklearn.preprocessing.LabelEncoder. Unfortunately, it can only convert one vector at a time, so we have to adjust it to apply it to multiple columns.
Doubtless, this mushroom classification is a vector?

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
class EncodeCategorical(BaseEstimator, TransformerMixin):
    """
    Encodes a specified list of columns or all columns if None.
    """

    def __init__(self, columns=None):
        self.columns  = [col for col in columns]
        self.encoders = None

    def fit(self, data, target=None):
        """
        Expects a data frame with named columns to encode.
        """
        # Encode all columns if columns is None
        if self.columns is None:
            self.columns = data.columns

        # Fit a label encoder for each column in the data frame
        self.encoders = {
            column: LabelEncoder().fit(data[column])
            for column in self.columns
        }
        return self

    def transform(self, data):
        """
        Uses the encoders to transform a data frame.
        """
        output = data.copy()
        for column, encoder in self.encoders.items():
            output[column] = encoder.transform(data[column])

        return output

Modeling and Evaluation

Common indicators for evaluating classifiers

Precision is the number of correct positive results divided by the number of all positive results (for example, how many edible mushrooms do we actually predict?)

Recall is the number of positive results that are correct divided by the number of positive results that should be returned (for example, how many toxic mushrooms do we accurately predict are toxic?

F1 score is a measure of test accuracy. It calculates scores by taking into account both the accuracy of the test and the recall rate. The F1 score can be interpreted as the weighted average of accuracy and recall rate, in which the F1 score reaches the best value at 1 and the worst value at 0.
precision = true positives / (true positives + false positives)

recall = true positives / (false negatives + true positives)

F1 score = 2 ((precision recall) / (precision + recall))
Now we are ready to make some predictions!

Let's build a method for evaluating multiple estimators -- first using traditional numerical scores (we'll compare some visual diagnostics in the Yellowbrick library later).

from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
def model_selection(X, y, estimator):
    """
    Test various estimators.
    """
    y = LabelEncoder().fit_transform(y.values.ravel())
    model = Pipeline([
         ('label_encoding', EncodeCategorical(X.keys())),
         ('one_hot_encoder', OneHotEncoder(categories='auto')),  # Add automatic categorization here, otherwise warning
         ('estimator', estimator)
    ])

    # Instantiate the classification model and visualizer
    model.fit(X, y)

    expected  = y
    predicted = model.predict(X)

    # Compute and return the F1 score (the harmonic mean of precision and recall)
    return (f1_score(expected, predicted))
from sklearn.svm import LinearSVC, NuSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier
from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier
model_selection(X, y, LinearSVC())
0.6582119537920643



import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn")  # Ignore warnings
model_selection(X, y, NuSVC())
0.6878837238441299



model_selection(X, y, SVC())
0.6625145971195017



model_selection(X, y, SGDClassifier())
0.5738408700629649



model_selection(X, y, KNeighborsClassifier())
0.6856846473029046



model_selection(X, y, LogisticRegressionCV())
0.6582119537920643



model_selection(X, y, LogisticRegression())
0.6578749058025622



model_selection(X, y, BaggingClassifier())
0.6873901878632248



model_selection(X, y, ExtraTreesClassifier())
0.6872294372294372



model_selection(X, y, RandomForestClassifier())
0.6992081007399714


Preliminary Model Assessment

Which model performs best according to the results of the above F1 scores?

Visual Model Assessment

Now let's reconstruct the model evaluation function, using the Classification Report class of Yellowbrick, a model visualization tool that displays accuracy, recall and F1 scores. This visual model analysis tool integrates numerical scores and color-coded thermograms to support simple interpretation and detection, especially the nuances of Type I errors and Type II errors that are very relevant to our use cases.

The first type of error (or "false positive") is to detect a non-existent effect (for example, when mushrooms are actually edible, they are toxic).

The second type of error (or "false negative" or "false negative") is the failure to detect the presence of an effect (for example, when a mushroom is actually toxic, it is considered edible).

from sklearn.pipeline import Pipeline
from yellowbrick.classifier import ClassificationReport


def visual_model_selection(X, y, estimator):
    """
    Test various estimators.
    """
    y = LabelEncoder().fit_transform(y.values.ravel())
    model = Pipeline([
         ('label_encoding', EncodeCategorical(X.keys())),
         ('one_hot_encoder', OneHotEncoder()),
         ('estimator', estimator)
    ])

    # Instantiate the classification model and visualizer
    visualizer = ClassificationReport(model, classes=['edible', 'poisonous'])
    visualizer.fit(X, y)
    visualizer.score(X, y)
    visualizer.poof()
visual_model_selection(X, y, LinearSVC())

# Visualization of other classifiers
visual_model_selection(X, y, RandomForestClassifier())

test

Now, which model looks best? Why?
Which model is most likely to save your life?
What is the difference between visual model evaluation and numerical model evaluation?

Precision recall rate Recall and comprehensive evaluation index F1-Measure
http://www.makaidong.com/%E5%...
f1-score considers the accuracy and recall rate comprehensively.
Visualization is intuition, escape~

Author's Brief Introduction

Know yeayee, Py 5 years old, good Flask+MongoDB+SKlearn+Bokeh

Topics: Python less Google network Database