TensorFlow 2.0 is a big slim for version 1.x, Eager Execution is turned on by default, and Keras is used as the default advanced API.
These improvements greatly reduce the difficulty of using TensorFlow.
This article describes a tortuous trampling experience using the BatchNormalization of Keras+TensorFlow2.0, which almost destroys all the new features of TF2.0. If you're learning the official TF2.0 tutorial, take a look.
Problem generation
From tutorial [1] https://www.tensorflow.org/alpha/tutorials/images/transfer_learning?hl=zh-cn (Tell me how to Transfer Learning) Say:
IMG_SHAPE = (IMG_SIZE, IMG_SIZE, 3) # Create the base model from the pre-trained model MobileNet V2 base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE, include_top=False,weights='imagenet') model = tf.keras.Sequential([ base_model, tf.keras.layers.GlobalAveragePooling2D(), tf.keras.layers.Dense(NUM_CLASSES) ])
Simple code we reused the structure of MobileNetV2 to create a classifier model, and then we could call Keras's interface to train the model:
model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=base_learning_rate), loss='sparse_categorical_crossentropy', metrics=['sparse_categorical_accuracy']) model.summary() history = model.fit(train_batches.repeat(), epochs=20, steps_per_epoch = steps_per_epoch, validation_data=validation_batches.repeat(), validation_steps=validation_steps)
The output looks perfect together:
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= mobilenetv2_1.00_160 (Model) (None, 5, 5, 1280) 2257984 _________________________________________________________________ global_average_pooling2d (Gl (None, 1280) 0 _________________________________________________________________ dense (Dense) (None, 2) 1281 ================================================================= Total params: 2,259,265 Trainable params: 1,281 Non-trainable params: 2,257,984 _________________________________________________________________ Epoch 11/20 581/581 [==============================] - 134s 231ms/step - loss: 0.4208 - accuracy: 0.9484 - val_loss: 0.1907 - val_accuracy: 0.9812 Epoch 12/20 581/581 [==============================] - 114s 197ms/step - loss: 0.3359 - accuracy: 0.9570 - val_loss: 0.1835 - val_accuracy: 0.9844 Epoch 13/20 581/581 [==============================] - 116s 200ms/step - loss: 0.2930 - accuracy: 0.9650 - val_loss: 0.1505 - val_accuracy: 0.9844 Epoch 14/20 581/581 [==============================] - 114s 196ms/step - loss: 0.2561 - accuracy: 0.9701 - val_loss: 0.1575 - val_accuracy: 0.9859 Epoch 15/20 581/581 [==============================] - 119s 206ms/step - loss: 0.2302 - accuracy: 0.9715 - val_loss: 0.1600 - val_accuracy: 0.9812 Epoch 16/20 581/581 [==============================] - 115s 197ms/step - loss: 0.2134 - accuracy: 0.9747 - val_loss: 0.1407 - val_accuracy: 0.9828 Epoch 17/20 581/581 [==============================] - 115s 197ms/step - loss: 0.1546 - accuracy: 0.9813 - val_loss: 0.0944 - val_accuracy: 0.9828 Epoch 18/20 581/581 [==============================] - 116s 200ms/step - loss: 0.1636 - accuracy: 0.9794 - val_loss: 0.0947 - val_accuracy: 0.9844 Epoch 19/20 581/581 [==============================] - 115s 198ms/step - loss: 0.1356 - accuracy: 0.9823 - val_loss: 0.1169 - val_accuracy: 0.9828 Epoch 20/20 581/581 [==============================] - 116s 199ms/step - loss: 0.1243 - accuracy: 0.9849 - val_loss: 0.1121 - val_accuracy: 0.9875
However, this is not a convenient way to write Debug. We wanted to be able to fine-tune the iteration process and see the intermediate results, so we changed our training process to this:
optimizer = tf.keras.optimizers.RMSprop(lr=base_learning_rate) train_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(name='train_accuracy') @tf.function def train_cls_step(image, label): with tf.GradientTape() as tape: predictions = model(image) loss = tf.keras.losses.SparseCategoricalCrossentropy()(label, predictions) gradients = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables)) train_accuracy(label, predictions) for images, labels in train_batches: train_cls_step(images,labels)
After retraining, the result is still perfect!
But at this point, we want to compare Finetune with starting training from scratch, so change the code that builds the model to this:
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE, include_top=False,weights=None)
Make the weight of the model randomly generated, then the training results will start pumping, Loss will not decrease, Accuracy stable around 50%:
Step #10: loss=0.6937199831008911 acc=46.5625% Step #20: loss=0.6932525634765625 acc=47.8125% Step #30: loss=0.699873685836792 acc=49.16666793823242% Step #40: loss=0.6910845041275024 acc=49.6875% Step #50: loss=0.6935917139053345 acc=50.0625% Step #60: loss=0.6965731382369995 acc=49.6875% Step #70: loss=0.6949992179870605 acc=49.19642639160156% Step #80: loss=0.6942993402481079 acc=49.84375% Step #90: loss=0.6933775544166565 acc=49.65277862548828% Step #100: loss=0.6928421258926392 acc=49.5% Step #110: loss=0.6883170008659363 acc=49.54545593261719% Step #120: loss=0.695658802986145 acc=49.453125% Step #130: loss=0.6875559091567993 acc=49.61538314819336% Step #140: loss=0.6851695775985718 acc=49.86606979370117% Step #150: loss=0.6978713274002075 acc=49.875% Step #160: loss=0.7165156602859497 acc=50.0% Step #170: loss=0.6945627331733704 acc=49.797794342041016% Step #180: loss=0.6936900615692139 acc=49.9305534362793% Step #190: loss=0.6938323974609375 acc=49.83552551269531% Step #200: loss=0.7030564546585083 acc=49.828125% Step #210: loss=0.6926192045211792 acc=49.76190185546875% Step #220: loss=0.6932414770126343 acc=49.786930084228516% Step #230: loss=0.6924526691436768 acc=49.82337188720703% Step #240: loss=0.6882281303405762 acc=49.869789123535156% Step #250: loss=0.6877702474594116 acc=49.86249923706055% Step #260: loss=0.6933954954147339 acc=49.77163314819336% Step #270: loss=0.6944763660430908 acc=49.75694274902344% Step #280: loss=0.6945018768310547 acc=49.49776840209961%
We printed the predictions and found that each output in the batch was the same:
0 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 1 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 2 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 3 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 4 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 5 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 6 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 7 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 8 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32) 9 = tf.Tensor([0.51352817 0.48647183], shape=(2,), dtype=float32)
Just modifying the initial weights, why does this happen?
Problem investigation
Experiment 1
Is there not enough training or is the learning rate set inappropriately?
After several rounds of adjustment, it was found that no matter how long the training lasted, learning rate became bigger and smaller, which could not change the result.
Experiment 2
Since it is a problem of weights, is it a problem of random initialization of weights, take out the initial weights to make statistics, everything is normal?
Experiment 3
Based on previous experience, BatchNormalization fails to handle this problem properly when exporting the Inference model, which results in the same batch.But how can I explain why this happens during training?And why doesn't Finetue have a problem?Just changed the initial value of the weight.
Google in this direction found that Keras did have a lot of issue s with BatchNormalization. One of the problems is that the moving mean and moving variance of BatchNormalization are not saved when the model is saved [6] https://github.com/tensorflow/tensorflow/issues/16455 And the other issue has something to do with our question:
[2] https://github.com/tensorflow/tensorflow/issues/19643
[3] https://github.com/tensorflow/tensorflow/issues/23873
Finally, the author finds the reason and summarizes it here:
[4] https://pgaleone.eu/tensorflow/keras/2019/01/19/keras-not-yet-interface-to-tensorflow/
With this hint, we did the following:
Experiment 3.1
In the first few epoch s, we found that training accuracy has started to increase slowly, but there are problems with validation accuracy.And if you get the intermediate result through model.predict_on_batch(), you still find that the output is the same within the batch.
Epoch 1/20 581/581 [==============================] - 162s 279ms/step - loss: 0.6768 - sparse_categorical_accuracy: 0.6224 - val_loss: 0.6981 - val_sparse_categorical_accuracy: 0.4984 Epoch 2/20 581/581 [==============================] - 133s 228ms/step - loss: 0.4847 - sparse_categorical_accuracy: 0.7684 - val_loss: 0.6931 - val_sparse_categorical_accuracy: 0.5016 Epoch 3/20 581/581 [==============================] - 130s 223ms/step - loss: 0.3905 - sparse_categorical_accuracy: 0.8250 - val_loss: 0.6996 - val_sparse_categorical_accuracy: 0.4984 Epoch 4/20 581/581 [==============================] - 131s 225ms/step - loss: 0.3113 - sparse_categorical_accuracy: 0.8660 - val_loss: 0.6935 - val_sparse_categorical_accuracy: 0.5016
However, as the training progressed, the results reversed and became normal (tf.function was written unchanged no matter what the training was, but fortunately it didn't give up treatment) (Append: there's still a problem here, keep looking back, it was strange at that time, it shouldn't converge so slowly)
Epoch 18/20 581/581 [==============================] - 131s 226ms/step - loss: 0.0731 - sparse_categorical_accuracy: 0.9725 - val_loss: 1.4896 - val_sparse_categorical_accuracy: 0.8703 Epoch 19/20 581/581 [==============================] - 130s 225ms/step - loss: 0.0664 - sparse_categorical_accuracy: 0.9748 - val_loss: 0.6890 - val_sparse_categorical_accuracy: 0.9016 Epoch 20/20 581/581 [==============================] - 126s 217ms/step - loss: 0.0631 - sparse_categorical_accuracy: 0.9768 - val_loss: 1.0290 - val_sparse_categorical_accuracy: 0.9031
Todo model.predict_on_batch() also gets the same results as this Accuracy
Experiment 3.2
Through the previous experiment, we have verified that if only Keras API is used to train, it is normal.What's the deeper reason?Is BatchNomalization not caused by update moving mean and move variance?The answer is Yes
Before and after the two training methods, we print the values of moving mean and moving variance:
def get_bn_vars(collection): moving_mean, moving_variance = None, None for var in collection: name = var.name.lower() if "variance" in name: moving_variance = var if "mean" in name: moving_mean = var if moving_mean is not None and moving_variance is not None: return moving_mean, moving_variance raise ValueError("Unable to find moving mean and variance") mean, variance = get_bn_vars(model.variables) print(mean) print(variance)
We found that, indeed, if you use model.fit() for training, mean and variance are updates (although the rate of updates looks strange), but for tf.function the two values are not updated
So here we can also explain why Finetune won't be a problem, because imagenet training mean, variance is already a good value to use without updating
Experiment 3.3
Is it OK to build a dynamic Input_Shape model instead of the method described in [4]?
class MyModel(Model): def __init__(self): super(MyModel, self).__init__() self.conv1 = Conv2D(32, 3, activation='relu') self.batch_norm1=BatchNormalization() self.flatten = Flatten() self.d1 = Dense(128, activation='relu') self.d2 = Dense(10, activation='softmax') def call(self, x): x = self.conv1(x) x = self.batch_norm1(x) x = self.flatten(x) x = self.d1(x) return self.d2(x) model = MyModel() #model.build((None,28,28,1)) model.summary() @tf.functiondef train_step(image, label): with tf.GradientTape() as tape: predictions = model(image) loss = loss_object(label, predictions) gradients = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables)) train_loss(loss) train_accuracy(label, predictions)
The model is as follows:
Model: "my_model" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d (Conv2D) multiple 320 _________________________________________________________________ batch_normalization_v2 (Batc multiple 128 _________________________________________________________________ flatten (Flatten) multiple 0 _________________________________________________________________ dense (Dense) multiple 2769024 _________________________________________________________________ dense_1 (Dense) multiple 1290 ================================================================= Total params: 2,770,762 Trainable params: 2,770,698 Non-trainable params: 64
From Output Shape, there's no problem building models
I ran MINST once and it turned out fine!
Just in case, we also tested whether mean s and variance s were updated, but the results were unexpected, and they weren't!
That is to say the scheme described in [4] is not feasible for us here
Experiment 3.4
Since our positioning problem is in BatchNormalization, it is thought that the behavior of BatchNormalization at training and testing is inconsistent. moving mean and variance at testing do not require update s. Would it be tf.function that does not automatically change this state?
Looking at the source code, we found that BatchNormalization's call() has a training parameter and the default is False
Call arguments: inputs: Input tensor (of any rank). training: Python boolean indicating whether the layer should behave in training mode or in inference mode. - `training=True`: The layer will normalize its inputs using the mean and variance of the current batch of inputs. - `training=False`: The layer will normalize its inputs using the mean and variance of its moving statistics, learned during training.
Therefore, the following improvements have been made:
class MyModel(Model): def __init__(self): super(MyModel, self).__init__() self.conv1 = Conv2D(32, 3, activation='relu') self.batch_norm1=BatchNormalization() self.flatten = Flatten() self.d1 = Dense(128, activation='relu') self.d2 = Dense(10, activation='softmax') def call(self, x,training=True): x = self.conv1(x) x = self.batch_norm1(x,training=training) x = self.flatten(x) x = self.d1(x) return self.d2(x) model = MyModel() #model.build((None,28,28,1)) model.summary() @tf.functiondef train_step(image, label): with tf.GradientTape() as tape: predictions = model(image,training=True) loss = loss_object(label, predictions) gradients = tape.gradient(loss, model.trainable_variables) optimizer.apply_gradients(zip(gradients, model.trainable_variables)) train_loss(loss) train_accuracy(label, predictions) @tf.functiondef test_step(image, label): predictions = model(image,training=False) t_loss = loss_object(label, predictions) test_loss(t_loss) test_accuracy(label, predictions)
The results show that moving mean and variance are starting to update, and testing Accuracy is expected
So we can determine if the root cause of the problem is whether BatchNormalization is a training or a testing!
Experiment 3.5
The 3.4 approach solves our problem, but it uses a subclass approach to building Model s, and our previous MobleNetV2 was built on the more flexible Keras Functional API. Because we can't control the definition of the call() function, there is no way to switch the state of training and testing flexibly, and the same is true when building with Sequential.
[5]https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html
[7]https://github.com/keras-team/keras/issues/7085
[8]https://github.com/keras-team/keras/issues/6752
From 5 [8], I learned two things.
- tf.keras.backend.set_learning_phase() can change the state of training and testing;
- model.updates and layer.updates hold Assign Op for old_value and new_value
So I'll try first:
tf.keras.backend.set_learning_phase(True)
As a result, the model built by MobileNetV2 also works.
And convergence seems to be much faster than model.fit(), combined with the confusion of slow convergence of model.fit(), here's an additional experiment that adds this sentence to the model.fit() version and finds that convergence is also faster!One epoch will give you good results!
So here's another question: Does model.fit() actually have a learning_phase state?If not how to update moving mean and variance?
The second approach, since the tutorial is about how to build on version 1.x, seems impossible to run these Assign Operation s in eager execution mode.Just for reference
update_ops = [] for assign_op in model.updates: update_ops.append(assign_op)) #But don't know what to do with getting these update_ops in eager execution mode?
conclusion
To summarize, we found the inspiration to solve the problem from [4], but it ultimately proved that the problems and solutions in [4] can not really solve the problem by using them here. The key to the problem is how we handle Layer r with inconsistent behavior in the training and testing states in Keras+TensorFlow2.0; and for the training methods of model.fit() and tf.funtion.Ultimately, model.fit() seems to contain a lot of weird behavior.
The final recommendations for use are as follows:
- When using API training models for Keras such as model.fit() or model.train_on_batch(), it is also recommended that tf.keras.backend.set_learning_phase(True) be set manually to speed up convergence.
- If the eager execution method is used,
- 1) Use the subclass that builds the Model, but set the state of training for call(), with Layer s like BatchNoramlization and Dropout handled differently
- 2) Build a Model using the Functional API or Sequential, set tf.keras.backend.set_learning_phase(True), but note that you can change the state while testing
Finally, why aren't these mentioned in the TF 2.0 tutorial?Are you familiar with Keras by default?[Cover your face and cry]
Thank
Thank you for your teacher's help
[1]https://www.tensorflow.org/alpha/tutorials/images/transfer_learning?hl=zh-cn
[2] https://github.com/tensorflow/tensorflow/issues/19643
[3] https://github.com/tensorflow/tensorflow/issues/23873
[4] https://pgaleone.eu/tensorflow/keras/2019/01/19/keras-not-yet-interface-to-tensorflow/
[5]https://blog.keras.io/keras-as-a-simplified-interface-to-tensorflow-tutorial.html
[6]https://github.com/tensorflow/tensorflow/issues/16455
[7]https://github.com/keras-team/keras/issues/7085
[8]https://github.com/keras-team/keras/issues/6752