Background introduction
When learning sklearn er, besides the difficulty of algorithm, we have to learn matplotlib visualization. For my practical application, visualization is more important. However, the ease of use and aesthetics of matplotlib are not complimentary. One after another, plotly and seaborn have been used and finally fixed in Bokeh, because it can be perfectly integrated with Flask, and the development of data Kanban is much less difficult.
Recently, I saw that this library can be more convenient to realize data exploration. Today I have to spare time to learn about it. Originally, I visited English documents, and found that some people have been doing Sinicization. Although it looks like Google translation, in the spirit of bringing in doctrine and less effort, I copied and learned half, but I still found some inconsistencies with the documents.
# http://www.scikit-yb.org/zh/latest/tutorial.html
Model Selection Course
In this tutorial, we will look at the scores of various Scikit-Learn models and compare them with Yellowbrick's visual diagnostic tools in order to select the best model for our data.
Model Selection Triple
Discussions on machine learning often focus on model selection. Whether it is logical regression, random forest, Bayesian method or artificial neural network, machine learning practitioners can usually quickly show their preferences. This is mainly due to historical reasons. Although modern third-party machine learning libraries make the deployment of various models insignificant, it has traditionally taken years to study the application and optimization of even one of the algorithms. Therefore, compared with other models, machine learning practitioners tend to have strong preferences for specific (and more likely familiar) models.
However, model selection is more subtle than simply choosing the "right" or "wrong" algorithm. The workflow in practice includes:
Select and/or design the smallest and most predictable feature set Select a set of algorithms from the model family, and Optimize the performance by superparametric optimization.
The model selection triple was first proposed by Kumar et al. in the SIGMOD paper in 2015. In their paper, they talked about the development of the next generation of database systems for predictive modeling. The author expresses very pertinently that machine learning is highly experimental in practice, so there is an urgent need for such a system. "Model selection," they explained, "is iterative and exploratory, because the space (model selection triple) is usually infinite, and it is usually impossible for analysts to know in advance which (combination) will produce satisfactory accuracy and/or insight."
Recently, many workflows have been automated through grid search methods, standardized API s and GUI-based applications. However, in practice, human intuition and guidance can focus more effectively on model quality than exhaustive search. Through the visual model selection process, data scientists can turn to the final, interpretable model and avoid traps.
Yellowbrick library is a visual diagnostic platform for machine learning, which allows data scientists to control the process of model selection. Yellowbrick extends Scikit-Learn's API: Visualizer with a new core object. Visualizers allow visual models to be matched and transformed as part of the Scikit-Learn pipeline process, thus providing visual diagnostics during the transformation of high-dimensional data.
About data
This tutorial uses a modified version of the mushroom data set from UCI Machine Learning Repository. Our goal is to predict whether mushrooms are toxic or edible based on their specificity.
These data include hypothetical sample descriptions corresponding to 23 species of roast mushrooms from Agaricus and Lepiota families. Each is determined to be absolutely edible, absolutely toxic, or unknown edibility and not recommended (the latter is combined with toxic species).
Our document "agaricus-lepiota.txt" contains three nominally valuable attribute information and a target value of 8124 mushroom instances (4208 edible and 3916 toxic).
Let's load the data with Pandas.
import os import pandas as pd mushrooms = 'data/shrooms.csv' # data set dataset = pd.read_csv(mushrooms) # dataset.columns = names dataset.head()
<div>
<style scoped>
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
id | class | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | ... | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat | Unnamed: 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | p | x | s | n | t | p | f | c | n | ... | w | w | p | w | o | p | k | s | u | NaN |
1 | 2 | e | x | s | y | t | a | f | c | b | ... | w | w | p | w | o | p | n | n | g | NaN |
2 | 3 | e | b | s | w | t | l | f | c | b | ... | w | w | p | w | o | p | n | n | m | NaN |
3 | 4 | p | x | y | w | t | p | f | c | n | ... | w | w | p | w | o | p | k | s | u | NaN |
4 | 5 | e | x | s | g | f | n | f | w | b | ... | w | w | p | w | o | e | n | a | g | NaN |
<p>5 rows × 25 columns</p>
</div>
features = ['cap-shape', 'cap-surface', 'cap-color'] target = ['class'] X = dataset[features] y = dataset[target]
dataset.shape # Two less mushrooms than official documents
(8122, 25)
dataset.groupby('class').count() # One less mushroom each
<div>
<style scoped>
.dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; }
</style>
id | cap-shape | cap-surface | cap-color | bruises | odor | gill-attachment | gill-spacing | gill-size | gill-color | ... | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat | Unnamed: 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
class | |||||||||||||||||||||
e | 4207 | 4207 | 4207 | 4207 | 4207 | 4207 | 4207 | 4207 | 4207 | 4207 | ... | 4207 | 4207 | 4207 | 4207 | 4207 | 4207 | 4207 | 4207 | 4207 | 0 |
p | 3915 | 3915 | 3915 | 3915 | 3915 | 3915 | 3915 | 3915 | 3915 | 3915 | ... | 3915 | 3915 | 3915 | 3915 | 3915 | 3915 | 3915 | 3915 | 3915 | 0 |
<p>2 rows × 24 columns</p>
</div>
feature extraction
Our data, including target parameters, are categorized data. In order to use machine learning, we need to convert these values into numerical data. In order to extract this from the data set, we must use Scikit-Learn transformers to transform the input data set into a data set suitable for the model. Fortunately, Sckit-Learn provides a converter for converting classification labels to integers: sklearn.preprocessing.LabelEncoder. Unfortunately, it can only convert one vector at a time, so we have to adjust it to apply it to multiple columns.
Doubtless, this mushroom classification is a vector?
from sklearn.base import BaseEstimator, TransformerMixin from sklearn.preprocessing import LabelEncoder, OneHotEncoder class EncodeCategorical(BaseEstimator, TransformerMixin): """ Encodes a specified list of columns or all columns if None. """ def __init__(self, columns=None): self.columns = [col for col in columns] self.encoders = None def fit(self, data, target=None): """ Expects a data frame with named columns to encode. """ # Encode all columns if columns is None if self.columns is None: self.columns = data.columns # Fit a label encoder for each column in the data frame self.encoders = { column: LabelEncoder().fit(data[column]) for column in self.columns } return self def transform(self, data): """ Uses the encoders to transform a data frame. """ output = data.copy() for column, encoder in self.encoders.items(): output[column] = encoder.transform(data[column]) return output
Modeling and Evaluation
Common indicators for evaluating classifiers
Precision is the number of correct positive results divided by the number of all positive results (for example, how many edible mushrooms do we actually predict?)
Recall is the number of positive results that are correct divided by the number of positive results that should be returned (for example, how many toxic mushrooms do we accurately predict are toxic?
F1 score is a measure of test accuracy. It calculates scores by taking into account both the accuracy of the test and the recall rate. The F1 score can be interpreted as the weighted average of accuracy and recall rate, in which the F1 score reaches the best value at 1 and the worst value at 0.
precision = true positives / (true positives + false positives)
recall = true positives / (false negatives + true positives)
F1 score = 2 ((precision recall) / (precision + recall))
Now we are ready to make some predictions!
Let's build a method for evaluating multiple estimators -- first using traditional numerical scores (we'll compare some visual diagnostics in the Yellowbrick library later).
from sklearn.metrics import f1_score from sklearn.pipeline import Pipeline def model_selection(X, y, estimator): """ Test various estimators. """ y = LabelEncoder().fit_transform(y.values.ravel()) model = Pipeline([ ('label_encoding', EncodeCategorical(X.keys())), ('one_hot_encoder', OneHotEncoder(categories='auto')), # Add automatic categorization here, otherwise warning ('estimator', estimator) ]) # Instantiate the classification model and visualizer model.fit(X, y) expected = y predicted = model.predict(X) # Compute and return the F1 score (the harmonic mean of precision and recall) return (f1_score(expected, predicted))
from sklearn.svm import LinearSVC, NuSVC, SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier
model_selection(X, y, LinearSVC())
0.6582119537920643
import warnings warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn") # Ignore warnings
model_selection(X, y, NuSVC())
0.6878837238441299
model_selection(X, y, SVC())
0.6625145971195017
model_selection(X, y, SGDClassifier())
0.5738408700629649
model_selection(X, y, KNeighborsClassifier())
0.6856846473029046
model_selection(X, y, LogisticRegressionCV())
0.6582119537920643
model_selection(X, y, LogisticRegression())
0.6578749058025622
model_selection(X, y, BaggingClassifier())
0.6873901878632248
model_selection(X, y, ExtraTreesClassifier())
0.6872294372294372
model_selection(X, y, RandomForestClassifier())
0.6992081007399714
Preliminary Model Assessment
Which model performs best according to the results of the above F1 scores?
Visual Model Assessment
Now let's reconstruct the model evaluation function, using the Classification Report class of Yellowbrick, a model visualization tool that displays accuracy, recall and F1 scores. This visual model analysis tool integrates numerical scores and color-coded thermograms to support simple interpretation and detection, especially the nuances of Type I errors and Type II errors that are very relevant to our use cases.
The first type of error (or "false positive") is to detect a non-existent effect (for example, when mushrooms are actually edible, they are toxic).
The second type of error (or "false negative" or "false negative") is the failure to detect the presence of an effect (for example, when a mushroom is actually toxic, it is considered edible).
from sklearn.pipeline import Pipeline from yellowbrick.classifier import ClassificationReport def visual_model_selection(X, y, estimator): """ Test various estimators. """ y = LabelEncoder().fit_transform(y.values.ravel()) model = Pipeline([ ('label_encoding', EncodeCategorical(X.keys())), ('one_hot_encoder', OneHotEncoder()), ('estimator', estimator) ]) # Instantiate the classification model and visualizer visualizer = ClassificationReport(model, classes=['edible', 'poisonous']) visualizer.fit(X, y) visualizer.score(X, y) visualizer.poof()
visual_model_selection(X, y, LinearSVC())
# Visualization of other classifiers visual_model_selection(X, y, RandomForestClassifier())
test
Now, which model looks best? Why? Which model is most likely to save your life? What is the difference between visual model evaluation and numerical model evaluation?
Precision recall rate Recall and comprehensive evaluation index F1-Measure
http://www.makaidong.com/%E5%...
f1-score considers the accuracy and recall rate comprehensively.
Visualization is intuition, escape~
Author's Brief Introduction
Know yeayee, Py 5 years old, good Flask+MongoDB+SKlearn+Bokeh