How to deal with over-fitting in Deep Learning Notes 10 ____________

Posted by ClevaTreva on Wed, 11 Sep 2019 10:52:05 +0200

Overfitting and Unfitting

Over-fitting phenomenon: The performance of the model in leaving validation data always reaches its peak after several rounds, and then begins to decline.
Unfitting phenomenon: The less the loss of training data, the less the loss of testing data.

Concepts of optimization and generalization

Optimizing: Adjusting the model to get the best performance on training data
Generalization: It refers to the performance of trained models on unprecedented data. The purpose of machine learning is, of course, to get good generalization.

How to Solve the Problem of Overfitting

The best solution is to get more training data.
The sub-optimal solution is regularization: adjusting the amount of information that the model allows to store (constraining the amount of information that the model allows to store)

The main regularization methods are:

Reduce network size
Adding Weight Regularization
Add dropout regularization

Reduce network size

The simplest way to prevent over-fitting is to reduce the size of the model, that is, to reduce the number of learnable parameters in the model (which is determined by the number of layers and the number of units in each layer). In fact, the principle of this method is to reduce the capacity of parameters.
Why is this possible?
For example, a model with 500,000 binary parameters can easily learn the categories corresponding to all the digits in the MNIST training set - we only need 50,000 digits each corresponding to 10 binary parameters. But this model is useless for the classification of new digital samples. There is no good generalization for new data.

Here's what you need to understand: * Deep learning models are usually good at fitting training data, but the real challenge is generalization, not fitting.

There is no standard formula for determining the size of a network: determining the optimal number of layers or the optimal size of each layer. Usually through the test: from small to large test parameters. Finally, determine a size.

The following cases: It can be seen intuitively that smaller networks begin to over-fit later than larger networks, and larger networks over-fit more seriously. The fluctuation of verification loss is also greater.

from keras.datasets import imdb

(train_data,train_labels),(test_data,test_labels) = imdb.load_data(num_words=10000)
# Encoding an integer sequence into a binary matrix
import numpy as np

def vectorize_sequences(sequences,dimension=10000):
    results = np.zeros((len(sequences),dimension))
    for i,sequence in enumerate(sequences):
        results[i,sequence] = 1.
    return results

# Transform data into matrices
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)

# Vectorization of labels
y_train = np.asarray(train_labels).astype('float32')
y_test = np.asarray(test_labels).astype('float32')

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(16, activation='relu',input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1,activation='sigmoid'))

# Standard optimizer
model.compile(optimizer='rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['accuracy']) # Monitoring accuracy in training process

# Fabrication of Verification Set
x_val = x_train[:10000]
partial_x_train = x_train[10000:]

y_val = y_train[:10000]
partial_y_train = y_train[10000:]

history = model.fit(partial_x_train,
                   partial_y_train,
                   epochs=20,
                   batch_size=512,
                   validation_data=(x_val,y_val))

history_dict = history.history

from keras import models
from keras import layers

model2 = models.Sequential()
model2.add(layers.Dense(4, activation='relu',input_shape=(10000,)))
model2.add(layers.Dense(4, activation='relu'))
model2.add(layers.Dense(1,activation='sigmoid'))

# Standard optimizer
model2.compile(optimizer='rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['accuracy']) # Monitoring accuracy in training process

history2 = model2.fit(partial_x_train,
                   partial_y_train,
                   epochs=20,
                   batch_size=512,
                   validation_data=(x_val,y_val))

history_dict2 = history2.history

# Drawing training loss congratulations to verify loss
import matplotlib.pyplot as plt

history_dict = history.history
loss_values_normal = history_dict['val_loss']
loss_values_less = history_dict2['val_loss']

epochs = range(1,len(loss_values)+1)

plt.plot(epochs,loss_values_normal,'b+',label='normal loss')
plt.plot(epochs,loss_values_less,'bo',label='less loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

Adding Weight Regularization

Occam's razor principle

Occam's razor principle
If there are two explanations for a thing, the most likely correct one is the simplest one, that is, the one with fewer assumptions.

It is also suitable for the neural network model: given some training data and a network architecture, many group weight values (i.e. many models) can interpret these data. Simple models are more difficult to over-fit than complex models.

The simple model here refers to a model with smaller entropy of parameter value distribution (or a model with fewer parameters), so according to this principle of knowledge, the common method of reducing over-fitting is to force the weight of the model to be smaller. This method is called weight regularization.

Realization of Weight Regularization

Adding costs associated with larger weights to the network loss function:

L1 regularization: The added cost is proportional to the absolute value of the weight coefficient.
L2 regularization: The added cost is proportional to the square of the weight coefficient (L2 norm of weight). L2 regularization of neural networks is also called weight decay.

In Keras, the method of adding weight regularization is to pass an example of weight regularization item to the layer. coding is as follows:

##Weight regularization
from keras import models
from keras import layers
from keras import regularizers

model3 = models.Sequential()
# Each coefficient of the weight matrix increases the total network loss by 0.001 * weight_coefficient_value.
model3.add(layers.Dense(16, kernel_regularizer = regularizers.l2(0.001), activation='relu',input_shape=(10000,)))
model3.add(layers.Dense(16, kernel_regularizer = regularizers.l2(0.001), activation='relu'))
model3.add(layers.Dense(1,activation='sigmoid'))

# Standard optimizer
model3.compile(optimizer='rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['accuracy']) # Monitoring accuracy in training process

history3 = model3.fit(partial_x_train,
                   partial_y_train,
                   epochs=20,
                   batch_size=512,
                   validation_data=(x_val,y_val))

history_dict3 = history3.history

# Drawing training loss congratulations to verify loss
import matplotlib.pyplot as plt

history_dict = history.history
loss_values_normal = history_dict['val_loss']
loss_values_regularized = history_dict3['val_loss']

epochs = range(1,len(loss_values)+1)

plt.plot(epochs,loss_values_normal,'b+',label='normal loss')
plt.plot(epochs,loss_values_regularized,'bo',label='regularized loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

#Other weight regularization can also be used

from keras import regularizers
# L1 regularization
regularizers.l1(0.001)
# Simultaneous regularization of L1 and L2
regularizers.l1_l2(l1=0.001,l2=0.001)

dropout regularization

Dropout is one of the most effective and commonly used regularization methods for neural networks. To use dropout for a certain layer is to randomly discard some output features of the layer (set to 0) in the training process. dropout rate is the proportion of features set to 0, usually 0.2-0.5.
Within the scope. This method cannot be used in testing.

The core idea of this method is to introduce noise into the output value of the layer and break the insignificant accidental mode (Hinton calls it conspiracy). If there is no noise, the network will remember these accidental patterns. coding is implemented as follows:

##Weight regularization
from keras import models
from keras import layers
from keras import regularizers

model4 = models.Sequential()
# Each coefficient of the weight matrix increases the total network loss by 0.001 * weight_coefficient_value.
model4.add(layers.Dense(16, kernel_regularizer = regularizers.l2(0.001), activation='relu',input_shape=(10000,)))
model4.add(layers.Dropout(0.5))
model4.add(layers.Dense(16, kernel_regularizer = regularizers.l2(0.001), activation='relu'))
model4.add(layers.Dropout(0.5))
model4.add(layers.Dense(1,activation='sigmoid'))

# Standard optimizer
model4.compile(optimizer='rmsprop',
              loss = 'binary_crossentropy',
              metrics = ['accuracy']) # Monitoring accuracy in training process

history4 = model4.fit(partial_x_train,
                   partial_y_train,
                   epochs=20,
                   batch_size=512,
                   validation_data=(x_val,y_val))

history_dict4 = history4.history

# Drawing training loss congratulations to verify loss
import matplotlib.pyplot as plt

history_dict = history.history
loss_values_normal = history_dict['val_loss']
loss_values_dropout = history_dict4['val_loss']

epochs = range(1,len(loss_values)+1)

plt.plot(epochs,loss_values_normal,'b+',label='normal loss')
plt.plot(epochs,loss_values_dropout,'bo',label='dropout loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()

plt.show()

Share good articles on AI, machine learning, in-depth learning and computer vision, and take notes of your own learning experience in this field. Want to go deep into the study of artificial intelligence with small partners to learn together! Get on the bus!

Topics: network less encoding

Programmer Think