AutoGluon tutorial 1 - simple entry model

Posted by alapimba on Mon, 24 Jan 2022 03:03:24 +0100

Write in front

Because I'm really lazy, and I don't have a solid basic knowledge of machine learning and deep learning, but I want to make some fun models under the banner of artificial intelligence. I heard here that a module can simply build a deep learning model, and the effect of adjusting the parameters is relatively good. What can I do because there are no tutorials on the Internet, Read and write by yourself.
Reference link:
GitHub source address: https://github.com/awslabs/autogluon
Official website tutorial: https://auto.gluon.ai/stable/index.html

install

At present, the module is only applicable to Linux and Mac systems, and the windows module is under development. If you want to have a taste, you can install wsl in windows, and then use the following statement to install the module

python3 -m pip install -U pip
python3 -m pip install -U setuptools wheel
python3 -m pip install autogluon
# cpu version
python3 -m pip install -U "mxnet<2.0.0"
# gpu version
# Here we assume CUDA 10.1 is installed.  You should change the number
# according to your own CUDA version (e.g. mxnet_cu100 for CUDA 10.0).
python3 -m pip install -U "mxnet_cu101<2.0.0"

Library function

In fact, the tasks to be completed in this library are divided into several tutorials, namely table prediction, image prediction, target detection, text prediction, etc. This article first completes the first tutorial table prediction

Tabular Prediction

Definition: according to personal understanding, this table prediction should belong to the input data table, and then do relevant machine learning tasks according to this information.
Advantages: no data cleaning, feature engineering, hyperparametric optimization and model selection

Example 1

Objective: to predict whether a person's income exceeds $50000

Import data, build objects

# import data 
import pandas 
import numpy 
from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
subsample_size = 500  # subsample subset of data for faster demo, try setting this to much larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head()

The AutoGluon Dataset object constructed here, that is, TabularDataset, is the data equivalent to pandas Frame, so you can use pandas Use dataframe attribute to use it, such as train mentioned above_ data. head(). Similarly, if you have your own data, you can construct objects according to the following picture tips

Anyway, check the help file when everything is in doubt

label = 'class'
pd.value_counts(train_data[label]) # Or train_data[label].value_counts() 
# The above two methods depend on personal habits
# Look at the labels and the number of corresponding labels

Training model

save_path = 'agModels-predictClass'  # specifies folder to store trained models
predictor = TabularPredictor(label=label, path=save_path).fit(train_data)

save_path is used to specify the folder of the training model
Then the TabularPredictor is used to build the model directly

The above figure is very interesting, which roughly explains why inference is a secondary classification task, how to modify it if it is not allowed, and how much memory is available now

then... Hahaha, the notebook is too garbage. You can automatically adjust the parameters according to the memory

Load test set and verify

# load test set 
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
y_test = test_data[label]  # values to predict
test_data_nolab = test_data.drop(columns=[label])  # delete label column to prove we're not cheating
test_data_nolab.head()
predictor = TabularPredictor.load(save_path)  # unnecessary, just demonstrates how to load previously-trained predictor from file

# model evalute 
y_pred = predictor.predict(test_data_nolab)
print("Predictions:  \n", y_pred)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred, auxiliary_metrics=True)

Let's show the data overview of the test set
Then there are the results. They are all basic model evaluation indicators. What I didn't expect is that the results of the same data and the same code training will be a little better than those on the official website. It may be due to different operating environments, but the difference is small, which is 4 percentage points worse

Demonstrate the effectiveness of all pre training models in the test set

predictor.leaderboard(test_data, silent=True)

From the classification accuracy, predictor Predict (test_data_nolab) uses weighensemble at this time_ L2 model
So far, the training of the first simple model has been completed, but I want to see the specific parameters of this model. This needs to be explored later, and I will release the follow-up tutorials. The first is the tutorial. If there are unreasonable places, please point out.

Show the accuracy of a particular classifier

predictor.predict(test_data, model='LightGBM')

Additional parts

Output prediction probability

pred_probs = predictor.predict_proba(test_data_nolab)
pred_probs.head(5)

What happens during fitting

results = predictor.fit_summary()

Higher output accuracy

time_limit: the longest waiting time for model training, which is usually not set
eval_metric: evaluation index, AUC or accuracy, etc
presets: defaults to 'medium'_ quality_ faster_ 'train 'loses accuracy, but it is fast. If it is set to "best_quality", bagging and stacking will be done to improve performance

time_limit = 60  # for quick demonstration only, you should set this to longest time you are willing to wait (in seconds)
metric = 'roc_auc'  # specify your evaluation metric here
predictor = TabularPredictor(label, eval_metric=metric).fit(train_data, time_limit=time_limit, presets='best_quality')
predictor.leaderboard(test_data, silent=True)

Topics: AI

Programmer Think