Introduction to deep learning series 11: reducing over fitting with Dropout regularization

Posted by wild_dog on Tue, 14 Dec 2021 22:25:22 +0100

Hello, my technician Howzit, this is the eleventh part of the introduction series of in-depth learning. Welcome to communicate together!

For neural networks and deep learning models, a simple and powerful regularization technique is Dropout. In this lesson, you will learn Dropout regularization technology and how to apply it to your model with Keras in Python. After learning, you will understand:

How does Dropout regularization work?
How do I use Dropout in the input layer?
How do I use Dropout in hidden layers?

11.1 neural network Dropout regularization

Dropout is a regularization technique for neural networks proposed by Srivastava et al. In their 2004 paper, dropout: A simple way to prevent neural networks from overfitting. Dropout is a technique that randomly ignores some neurons during training. They are dropped out randomly, which means that the contribution to the downstream neuron activation function is removed in the forward calculation, and those weights are not updated in the reverse transmission.

With the continuous learning of neural network model, the weights of neurons will match the context of the whole network. The weights of neurons are optimized according to some characteristics and have some specialization. The surrounding neurons will rely on this specialization. If it is too specialized, the model will become fragile because of over fitting the training data. This context dependent phenomenon of neurons in the training process is called complex co adaptations. You can imagine that some neurons are randomly discarded during training, and other neurons have to predict the lost neural representation. We imagine that this leads to the network learning multiple independent internal representations.

As a result, the network becomes less sensitive to the specific weights of neurons. In turn, this will lead to better generalization ability of the network and is unlikely to over fit the training data.

11.2 using Dropout regularization in Keras

Dropout can be easily implemented by randomly selecting nodes, and these nodes are discarded with a certain probability (e.g. 20%) in each weight update cycle. This is how dropout is implemented in Keras. Dropout is only used in the training model, not in the evaluation model. Next, we will explore different methods of using dropout in Keras.

This example will use the Sonar binary dataset. We will use 10 fold cross validation in scikit learn to evaluate the developed model in order to better obtain different results. There are 60 input values and one output value in the network, and these input values are standardized before they are used. The benchmark neural network model has two hidden layers. 60 on the first floor and 30 on the second floor. The model is trained with random gradient descent and relatively low learning rate and momentum. The complete benchmark model is as follows:

# Baseline Model on the Sonar Dataset

import numpy
from pandas import read_csv
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.constraints import max_norm
from tensorflow.keras.optimizers import SGD
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline 

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load dataset
dataframe = read_csv("sonar.csv", header=None)
dataset = dataframe.values

# split into input (X) and output (Y) variables
X = dataset[:, 0:60].astype(float)
Y = dataset[:, 60]

# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

# baseline
def create_baseline():
	model = Sequential()
	model.add(Dense(60, input_dim=60, activation="relu"))
	model.add(Dense(30, activation="relu"))
	model.add(Dense(1, activation="sigmoid"))
	# Compile model
	sgd = SGD(lr=0.01, momentum=0.8, decay=0.0, nesterov=False)
	model.compile(loss="binary_crossentropy", optimizer=sgd, metrics=["accuracy"])
	return model

numpy.random.seed(seed)

estimators = []
estimators.append(("standardize", StandardScaler()))
estimators.append(("mlp", KerasClassifier(build_fn=create_baseline, nb_epoch=300, batch_size=16, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100))

Running this benchmark model, the approximate classification accuracy generated without Dropout is 82%

Baseline: 86.04% (4.58%)

11.3 using Dropout on the input layer

Input neurons, also known as the visible layer, can use Dropout. In the following example, we add a Dropout layer between the input layer and the first visible layer. The Dropout rate is set to 20%, which means that 1 / 5 neurons will be randomly excluded in the update cycle.

In addition, as suggested in the original paper on Dropout, each neuron imposes a constraint to ensure that the maximum norm of weight does not exceed 3. When I build the layer, set w in the Dense class_ Constraint parameter. The learning rate also increased by an order of magnitude and the momentum increased to 0.9. The original paper also recommends increasing the learning rate. Continuing from the benchmark example below, the following code uses the same network to practice the input layer Dropout.

# Baseline Model on the Sonar Dataset
import numpy
from pandas import read_csv
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.constraints import max_norm
from tensorflow.keras.optimizers import SGD
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline 

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load dataset

dataframe = read_csv("sonar.csv", header=None)
dataset = dataframe.values

# split into input (X) and output (Y) variables
X = dataset[:, 0:60].astype(float)
Y = dataset[:, 60]

# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

# baseline
def create_baseline():
	model = Sequential()
	model.add(Dropout(0.2, input_shape=(60,)))
	model.add(Dense(60, input_dim=60, activation="relu"))
	model.add(Dense(30, activation="relu"))
	model.add(Dense(1, activation="sigmoid"))
	
	# Compile model
	sgd = SGD(lr=0.01, momentum=0.8, decay=0.0, nesterov=False)
	model.compile(loss="binary_crossentropy", optimizer=sgd, metrics=["accuracy"])
	return model

numpy.random.seed(seed)
estimators = []
estimators.append(("standardize", StandardScaler()))
estimators.append(("mlp", KerasClassifier(build_fn=create_baseline, nb_epoch=300, batch_size=16, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)

print("Baseline: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100))

Run the example with Dropout layer in the visible layer, and the classification accuracy is well improved to 86%

Visible: 83.52% (7.68%)

11.4 using Dropout on hidden layers

Dropout can also be applied to hidden neurons in your network model. In the following example, dropout can be applied between multiple hidden layers and between the last hidden layer and the output layer. Dropout rate of 20% is used again on these layer weight constraints.

# Baseline Model on the Sonar Dataset

import numpy

from pandas import read_csv
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.constraints import max_norm
from tensorflow.keras.optimizers import SGD
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline 

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# load dataset
dataframe = read_csv("sonar.csv", header=None)
dataset = dataframe.values

# split into input (X) and output (Y) variables
X = dataset[:, 0:60].astype(float)
Y = dataset[:, 60]

# encode class values as integers
encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

# baseline
def create_baseline():

	model = Sequential()
	model.add(Dense(60, input_dim=60, activation="relu", W_constraint=max_norm(3)))
	model.add(Dropout(0.2))
	model.add(Dense(30, activation="relu", W_constraint=max_norm(3)))
	model.add(Dropout(0.2))
	model.add(Dense(1, activation="sigmoid"))

	# Compile model
	sgd = SGD(lr=0.01, momentum=0.8, decay=0.0, nesterov=False)
	model.compile(loss="binary_crossentropy", optimizer=sgd, metrics=["accuracy"])
	return model

numpy.random.seed(seed)

estimators = []
estimators.append(("standardize", StandardScaler()))
estimators.append(("mlp", KerasClassifier(build_fn=create_baseline, nb_epoch=300, batch_size=16, verbose=0)))
pipeline = Pipeline(estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
results = cross_val_score(pipeline, X, encoded_Y, cv=kfold)
print("Baseline: %.2f%% (%.2f%%)" % (results.mean() * 100, results.std() * 100))

We can see that for this problem and the selected network, the use of Dropout on the hidden layer has not been improved. In fact, the performance baseline is poor. You may need to increase the number of training or further adjust the learning rate.

Hidden: 83.59% (7.31%)

11.5 tips for using Dropout

Dropout's original paper provides experimental results on a set of standard machine learning problems. Therefore, they provide many useful heuristics and consider using dropout in practice:

Generally, the Dropout value of neurons is 20% - 50%, and 20% is a good starting point. Too low probability has the least impact, and too high value will lead to insufficient network learning.
Use a larger network. If you use a larger network, you may get a better performance and give the model more opportunities to learn independent representation.
Use Dropout on input and hidden layers. Using Dropout regularization technology on each layer of the network, the results are good.
Use a large learning rate with delay and large point momentum. Increase the learning rate by 10 to 100 times and use high momentum values of 0.9 or 0.99.
Constraints on the size of network weight. A large learning rate leads to a very large network weight. Applying a constraint on the network weight, such as using max norm regularization with the size of 4 or 5, has proved to improve the results.

11.6 summary

In this lesson, you have found the Dropout regularization technology under the deep learning model. You have learned:

What is Dropout and how it works
How to use Dropout on your own deep learning model.
Tips for using Dropout on your own model to get the best results.

11.6. 1 next

Another important technique to improve your model is to adjust the learning rate during training. In the next course, you will learn about schedules with different learning rates, and you can use them on your problems through Keras.

Topics: neural networks TensorFlow Deep Learning

Programmer Think