{"title": "Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results", "book": "Advances in Neural Information Processing Systems", "page_first": 1195, "page_last": 1204, "abstract": "The recently proposed Temporal Ensembling has achieved state-of-the-art results in several semi-supervised learning benchmarks. It maintains an exponential moving average of label predictions on each training example, and penalizes predictions that are inconsistent with this target. However, because the targets change only once per epoch, Temporal Ensembling becomes unwieldy when learning large datasets. To overcome this problem, we propose Mean Teacher, a method that averages model weights instead of label predictions. As an additional benefit, Mean Teacher improves test accuracy and enables training with fewer labels than Temporal Ensembling. Without changing the network architecture, Mean Teacher achieves an error rate of 4.35% on SVHN with 250 labels, outperforming Temporal Ensembling trained with 1000 labels. We also show that a good network architecture is crucial to performance. Combining Mean Teacher and Residual Networks, we improve the state of the art on CIFAR-10 with 4000 labels from 10.55% to 6.28%, and on ImageNet 2012 with 10% of the labels from 35.24% to 9.11%.", "full_text": "Mean teachers are better role models:\n\nWeight-averaged consistency targets improve\n\nsemi-supervised deep learning results\n\nAntti Tarvainen\n\nThe Curious AI Company\n\ntarvaina@cai.fi\n\nHarri Valpola\n\nThe Curious AI Company\n\nharri@cai.fi\n\nAbstract\n\nThe recently proposed Temporal Ensembling has achieved state-of-the-art results in\nseveral semi-supervised learning benchmarks. It maintains an exponential moving\naverage of label predictions on each training example, and penalizes predictions\nthat are inconsistent with this target. However, because the targets change only once\nper epoch, Temporal Ensembling becomes unwieldy when learning large datasets.\nTo overcome this problem, we propose Mean Teacher, a method that averages\nmodel weights instead of label predictions. As an additional bene\ufb01t, Mean Teacher\nimproves test accuracy and enables training with fewer labels than Temporal\nEnsembling. Without changing the network architecture, Mean Teacher achieves an\nerror rate of 4.35% on SVHN with 250 labels, outperforming Temporal Ensembling\ntrained with 1000 labels. We also show that a good network architecture is crucial\nto performance. Combining Mean Teacher and Residual Networks, we improve\nthe state of the art on CIFAR-10 with 4000 labels from 10.55% to 6.28%, and on\nImageNet 2012 with 10% of the labels from 35.24% to 9.11%.\n\n1\n\nIntroduction\n\nDeep learning has seen tremendous success in areas such as image and speech recognition. In order\nto learn useful abstractions, deep learning models require a large number of parameters, thus making\nthem prone to over-\ufb01tting (Figure 1a). Moreover, adding high-quality labels to training data manually\nis often expensive. Therefore, it is desirable to use regularization methods that exploit unlabeled data\neffectively to reduce over-\ufb01tting in semi-supervised learning.\nWhen a percept is changed slightly, a human typically still considers it to be the same object. Corre-\nspondingly, a classi\ufb01cation model should favor functions that give consistent output for similar data\npoints. One approach for achieving this is to add noise to the input of the model. To enable the model\nto learn more abstract invariances, the noise may be added to intermediate representations, an insight\nthat has motivated many regularization techniques, such as Dropout [27]. Rather than minimizing\nthe classi\ufb01cation cost at the zero-dimensional data points of the input space, the regularized model\nminimizes the cost on a manifold around each data point, thus pushing decision boundaries away\nfrom the labeled data points (Figure 1b).\nSince the classi\ufb01cation cost is unde\ufb01ned for unlabeled examples, the noise regularization by itself\ndoes not aid in semi-supervised learning. To overcome this, the model [20] evaluates each data\npoint with and without noise, and then applies a consistency cost between the two predictions. In this\ncase, the model assumes a dual role as a teacher and a student. As a student, it learns as before; as a\nteacher, it generates targets, which are then used by itself as a student for learning. Since the model\nitself generates targets, they may very well be incorrect. If too much weight is given to the generated\ntargets, the cost of inconsistency outweighs that of misclassi\ufb01cation, preventing the learning of new\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: A sketch of a binary classi\ufb01cation task with two labeled examples (large blue dots) and\none unlabeled example, demonstrating how the choice of the unlabeled target (black circle) affects\nthe \ufb01tted function (gray curve). (a) A model with no regularization is free to \ufb01t any function that\npredicts the labeled training examples well. (b) A model trained with noisy labeled data (small dots)\nlearns to give consistent predictions around labeled data points. (c) Consistency to noise around\nunlabeled examples provides additional smoothing. For the clarity of illustration, the teacher model\n(gray curve) is \ufb01rst \ufb01tted to the labeled examples, and then left unchanged during the training of the\nstudent model. Also for clarity, we will omit the small dots in \ufb01gures d and e. (d) Noise on the teacher\nmodel reduces the bias of the targets without additional training. The expected direction of stochastic\ngradient descent is towards the mean (large blue circle) of individual noisy targets (small blue circles).\n(e) An ensemble of models gives an even better expected target. Both Temporal Ensembling and the\nMean Teacher method use this approach.\n\ninformation. In effect, the model suffers from con\ufb01rmation bias (Figure 1c), a hazard that can be\nmitigated by improving the quality of targets.\nThere are at least two ways to improve the target quality. One approach is to choose the perturbation\nof the representations carefully instead of barely applying additive or multiplicative noise. Another\napproach is to choose the teacher model carefully instead of barely replicating the student model.\nConcurrently to our research, Miyato et al. [15] have taken the \ufb01rst approach and shown that Virtual\nAdversarial Training can yield impressive results. We take the second approach and will show that it\ntoo provides signi\ufb01cant bene\ufb01ts. To our understanding, these two approaches are compatible, and\ntheir combination may produce even better outcomes. However, the analysis of their combined effects\nis outside the scope of this paper.\nOur goal, then, is to form a better teacher model from the student model without additional training.\nAs the \ufb01rst step, consider that the softmax output of a model does not usually provide accurate\npredictions outside training data. This can be partly alleviated by adding noise to the model at\ninference time [4], and consequently a noisy teacher can yield more accurate targets (Figure 1d). This\napproach was used in Pseudo-Ensemble Agreement [2] and has lately been shown to work well on\nsemi-supervised image classi\ufb01cation [13, 22]. Laine & Aila [13] named the method the \u21e7 model; we\nwill use this name for it and their version of it as the basis of our experiments.\nThe \u21e7 model can be further improved by Temporal Ensembling [13], which maintains an exponential\nmoving average (EMA) prediction for each of the training examples. At each training step, all\nthe EMA predictions of the examples in that minibatch are updated based on the new predictions.\nConsequently, the EMA prediction of each example is formed by an ensemble of the model\u2019s current\nversion and those earlier versions that evaluated the same example. This ensembling improves the\nquality of the predictions, and using them as the teacher predictions improves results. However, since\neach target is updated only once per epoch, the learned information is incorporated into the training\nprocess at a slow pace. The larger the dataset, the longer the span of the updates, and in the case of\non-line learning, it is unclear how Temporal Ensembling can be used at all. (One could evaluate all\nthe targets periodically more than once per epoch, but keeping the evaluation span constant would\nrequire O(n2) evaluations per epoch where n is the number of training examples.)\n\n2 Mean Teacher\n\nTo overcome the limitations of Temporal Ensembling, we propose averaging model weights instead\nof predictions. Since the teacher model is an average of consecutive student models, we call this the\nMean Teacher method (Figure 2). Averaging model weights over training steps tends to produce a\n\n2\n\n\fprediction\n\n3\n\nprediction\n\n3\n\nconsistency \n\ncost\n\n\u03b8\n\n\u03b8\u2019\n\nexponential \n\nmoving \naverage\n\n\u03b7\u2019\n\nclassi\ufb01cation \n\ncost\n\n\u03b7\n\n3\nlabel\n\ninput\n\nstudent model\n\nteacher model\n\nFigure 2: The Mean Teacher method. The \ufb01gure depicts a training batch with a single labeled\nexample. Both the student and the teacher model evaluate the input applying noise (\u2318, \u23180) within\ntheir computation. The softmax output of the student model is compared with the one-hot label\nusing classi\ufb01cation cost and with the teacher output using consistency cost. After the weights of the\nstudent model have been updated with gradient descent, the teacher model weights are updated as an\nexponential moving average of the student weights. Both model outputs can be used for prediction,\nbut at the end of the training the teacher prediction is more likely to be correct. A training step with\nan unlabeled example would be similar, except no classi\ufb01cation cost would be applied.\n\nmore accurate model than using the \ufb01nal weights directly [18]. We can take advantage of this during\ntraining to construct better targets. Instead of sharing the weights with the student model, the teacher\nmodel uses the EMA weights of the student model. Now it can aggregate information after every\nstep instead of every epoch. In addition, since the weight averages improve all layer outputs, not just\nthe top output, the target model has better intermediate representations. These aspects lead to two\npractical advantages over Temporal Ensembling: First, the more accurate target labels lead to a faster\nfeedback loop between the student and the teacher models, resulting in better test accuracy. Second,\nthe approach scales to large datasets and on-line learning.\nMore formally, we de\ufb01ne the consistency cost J as the expected distance between the prediction of\nthe student model (with weights \u2713 and noise \u2318) and the prediction of the teacher model (with weights\n\u27130 and noise \u23180).\n\nJ(\u2713) = Ex,\u23180,\u2318hkf (x, \u27130,\u2318 0) f (x, \u2713, \u2318)k2i\n\nThe difference between the \u21e7 model, Temporal Ensembling, and Mean teacher is how the teacher\npredictions are generated. Whereas the \u21e7 model uses \u27130 = \u2713, and Temporal Ensembling approximates\nf (x, \u27130,\u2318 0) with a weighted average of successive predictions, we de\ufb01ne \u27130t at training step t as the\nEMA of successive \u2713 weights:\n\n\u27130t = \u21b5\u27130t1 + (1 \u21b5)\u2713t\n\nwhere \u21b5 is a smoothing coef\ufb01cient hyperparameter. An additional difference between the three\nalgorithms is that the \u21e7 model applies training to \u27130 whereas Temporal Ensembling and Mean Teacher\ntreat it as a constant with regards to optimization.\nWe can approximate the consistency cost function J by sampling noise \u2318, \u23180 at each training step\nwith stochastic gradient descent. Following Laine & Aila [13], we use mean squared error (MSE) as\nthe consistency cost in most of our experiments.\n\n3\n\n\fTable 1: Error rate percentage on SVHN over 10 runs (4 runs when using all labels). We use\nexponential moving average weights in the evaluation of all our models. All the methods use a similar\n13-layer ConvNet architecture. See Table 5 in the Appendix for results without input augmentation.\n\n250 labels\n73257 images\n\nGAN [24]\n\u21e7 model [13]\nTemporal Ensembling [13]\nVAT+EntMin [15]\nSupervised-only\n\u21e7 model\nMean Teacher\n\n27.77 \u00b1 3.18\n9.69 \u00b1 0.92\n4.35 \u00b1 0.50\n4.35 \u00b1 0.50\n4.35 \u00b1 0.50\n\n500 labels\n73257 images\n18.44 \u00b1 4.8\n6.65 \u00b1 0.53\n5.12 \u00b1 0.13\n16.88 \u00b1 1.30\n6.83 \u00b1 0.66\n4.18 \u00b1 0.27\n4.18 \u00b1 0.27\n4.18 \u00b1 0.27\n\n1000 labels\n73257 images\n8.11 \u00b1 1.3\n4.82 \u00b1 0.17\n4.42 \u00b1 0.16\n3.863.863.86\n12.32 \u00b1 0.95\n4.95 \u00b1 0.26\n3.95 \u00b1 0.19\n\n73257 labels\n73257 images\n\n2.54 \u00b1 0.04\n2.74 \u00b1 0.06\n2.75 \u00b1 0.10\n2.50 \u00b1 0.07\n2.50 \u00b1 0.05\n2.50 \u00b1 0.05\n2.50 \u00b1 0.05\n\nTable 2: Error rate percentage on CIFAR-10 over 10 runs (4 runs when using all labels).\n\n1000 labels\n50000 images\n\n2000 labels\n50000 images\n\nGAN [24]\n\u21e7 model [13]\nTemporal Ensembling [13]\nVAT+EntMin [15]\nSupervised-only\n\u21e7 model\nMean Teacher\n\n46.43 \u00b1 1.21\n27.36 \u00b1 1.20\n21.55 \u00b1 1.48\n21.55 \u00b1 1.48\n21.55 \u00b1 1.48\n\n33.94 \u00b1 0.73\n18.02 \u00b1 0.60\n15.73 \u00b1 0.31\n15.73 \u00b1 0.31\n15.73 \u00b1 0.31\n\n4000 labels\n50000 images\n18.63 \u00b1 2.32\n12.36 \u00b1 0.31\n12.16 \u00b1 0.31\n10.55\n10.55\n10.55\n20.66 \u00b1 0.57\n13.20 \u00b1 0.27\n12.31 \u00b1 0.28\n\n50000 labels\n50000 images\n\n5.56 \u00b1 0.10\n5.60 \u00b1 0.10\n5.60 \u00b1 0.10\n5.60 \u00b1 0.10\n5.82 \u00b1 0.15\n6.06 \u00b1 0.11\n5.94 \u00b1 0.15\n\n3 Experiments\n\nTo test our hypotheses, we \ufb01rst replicated the \u21e7 model [13] in TensorFlow [1] as our baseline. We\nthen modi\ufb01ed the baseline model to use weight-averaged consistency targets. The model architecture\nis a 13-layer convolutional neural network (ConvNet) with three types of noise: random translations\nand horizontal \ufb02ips of the input images, Gaussian noise on the input layer, and dropout applied within\nthe network. We use mean squared error as the consistency cost and ramp up its weight from 0 to\nits \ufb01nal value during the \ufb01rst 80 epochs. The details of the model and the training procedure are\ndescribed in Appendix B.1.\n\n3.1 Comparison to other methods on SVHN and CIFAR-10\n\nWe ran experiments using the Street View House Numbers (SVHN) and CIFAR-10 benchmarks [16].\nBoth datasets contain 32x32 pixel RGB images belonging to ten different classes. In SVHN, each\nexample is a close-up of a house number, and the class represents the identity of the digit at the center\nof the image. In CIFAR-10, each example is a natural image belonging to a class such as horses, cats,\ncars and airplanes. SVHN contains of 73257 training samples and 26032 test samples. CIFAR-10\nconsists of 50000 training samples and 10000 test samples.\nTables 1 and 2 compare the results against recent state-of-the-art methods. All the methods in the\ncomparison use a similar 13-layer ConvNet architecture. Mean Teacher improves test accuracy\nover the \u21e7 model and Temporal Ensembling on semi-supervised SVHN tasks. Mean Teacher also\nimproves results on CIFAR-10 over our baseline \u21e7 model.\nThe recently published version of Virtual Adversarial Training by Miyato et al. [15] performs even\nbetter than Mean Teacher on the 1000-label SVHN and the 4000-label CIFAR-10. As discussed in the\nintroduction, VAT and Mean Teacher are complimentary approaches. Their combination may yield\nbetter accuracy than either of them alone, but that investigation is beyond the scope of this paper.\n\n4\n\n\fTable 3: Error percentage over 10 runs on SVHN with extra unlabeled training data.\n\n\u21e7 model (ours)\nMean Teacher\n\n500 labels\n73257 images\n6.83 \u00b1 0.66\n4.18 \u00b1 0.27\n4.18 \u00b1 0.27\n4.18 \u00b1 0.27\n\n500 labels\n173257 images\n4.49 \u00b1 0.27\n3.02 \u00b1 0.16\n3.02 \u00b1 0.16\n3.02 \u00b1 0.16\n\n500 labels\n573257 images\n3.26 \u00b1 0.14\n2.46 \u00b1 0.06\n2.46 \u00b1 0.06\n2.46 \u00b1 0.06\n\nFigure 3: Smoothened classi\ufb01cation cost (top) and classi\ufb01cation error (bottom) of Mean Teacher and\nour baseline \u21e7 model on SVHN over the \ufb01rst 100000 training steps. In the upper row, the training\nclassi\ufb01cation costs are measured using only labeled data.\n\n3.2 SVHN with extra unlabeled data\n\nAbove, we suggested that Mean Teacher scales well to large datasets and on-line learning. In addition,\nthe SVHN and CIFAR-10 results indicate that it uses unlabeled examples ef\ufb01ciently. Therefore, we\nwanted to test whether we have reached the limits of our approach.\nBesides the primary training data, SVHN includes also an extra dataset of 531131 examples. We\npicked 500 samples from the primary training as our labeled training examples. We used the rest of\nthe primary training set together with the extra training set as unlabeled examples. We ran experiments\nwith Mean Teacher and our baseline \u21e7 model, and used either 0, 100000 or 500000 extra examples.\nTable 3 shows the results.\n\n3.3 Analysis of the training curves\n\nThe training curves on Figure 3 help us understand the effects of using Mean Teacher. As expected, the\nEMA-weighted models (blue and dark gray curves in the bottom row) give more accurate predictions\nthan the bare student models (orange and light gray) after an initial period.\nUsing the EMA-weighted model as the teacher improves results in the semi-supervised settings.\nThere appears to be a virtuous feedback cycle of the teacher (blue curve) improving the student\n(orange) via the consistency cost, and the student improving the teacher via exponential moving\naveraging. If this feedback cycle is detached, the learning is slower, and the model starts to over\ufb01t\nearlier (dark gray and light gray).\nMean Teacher helps when labels are scarce. When using 500 labels (middle column) Mean Teacher\nlearns faster, and continues training after the \u21e7 model stops improving. On the other hand, in the\nall-labeled case (left column), Mean Teacher and the \u21e7 model behave virtually identically.\n\n5\n\n\fFigure 4: Validation error on 250-label SVHN over four runs per hyperparameter setting and\ntheir means.\nIn each experiment, we varied one hyperparameter, and used the evaluation run\nhyperparameters of Table 1 for the rest. The hyperparameter settings used in the evaluation runs are\nmarked with the bolded font weight. See the text for details.\n\nMean Teacher uses unlabeled training data more ef\ufb01ciently than the \u21e7 model, as seen in the middle\ncolumn. On the other hand, with 500k extra unlabeled examples (right column), \u21e7 model keeps\nimproving for longer. Mean Teacher learns faster, and eventually converges to a better result, but the\nsheer amount of data appears to offset \u21e7 model\u2019s worse predictions.\n\n3.4 Ablation experiments\n\nTo assess the importance of various aspects of the model, we ran experiments on SVHN with 250\nlabels, varying one or a few hyperparameters at a time while keeping the others \ufb01xed.\nRemoval of noise (Figures 4(a) and 4(b)). In the introduction and Figure 1, we presented the\nhypothesis that the \u21e7 model produces better predictions by adding noise to the model on both sides.\nBut after the addition of Mean Teacher, is noise still needed? Yes. We can see that either input\naugmentation or dropout is necessary for passable performance. On the other hand, input noise does\nnot help when augmentation is in use. Dropout on the teacher side provides only a marginal bene\ufb01t\nover just having it on the student side, at least when input augmentation is in use.\nSensitivity to EMA decay and consistency weight (Figures 4(c) and 4(d)). The essential hyperpa-\nrameters of the Mean Teacher algorithm are the consistency cost weight and the EMA decay \u21b5. How\nsensitive is the algorithm to their values? We can see that in each case the good values span roughly\nan order of magnitude and outside these ranges the performance degrades quickly. Note that EMA\ndecay \u21b5 = 0 makes the model a variation of the \u21e7 model, although somewhat inef\ufb01cient one because\nthe gradients are propagated through only the student path. Note also that in the evaluation runs we\nused EMA decay \u21b5 = 0.99 during the ramp-up phase, and \u21b5 = 0.999 for the rest of the training. We\nchose this strategy because the student improves quickly early in the training, and thus the teacher\nshould forget the old, inaccurate, student weights quickly. Later the student improvement slows, and\nthe teacher bene\ufb01ts from a longer memory.\nDecoupling classi\ufb01cation and consistency (Figure 4(e)). The consistency to teacher predictions\nmay not necessarily be a good proxy for the classi\ufb01cation task, especially early in the training. So\nfar our model has strongly coupled these two tasks by using the same output for both. How would\ndecoupling the tasks change the performance of the algorithm? To investigate, we changed the model\nto have two top layers and produce two outputs. We then trained one of the outputs for classi\ufb01cation\nand the other for consistency. We also added a mean squared error cost between the output logits, and\nthen varied the weight of this cost, allowing us to control the strength of the coupling. Looking at the\nresults (reported using the EMA version of the classi\ufb01cation output), we can see that the strongly\ncoupled version performs well and the too loosely coupled versions do not. On the other hand, a\nmoderate decoupling seems to have the bene\ufb01t of making the consistency ramp-up redundant.\n\n6\n\n\fTable 4: Error rate percentage of ResNet Mean Teacher compared to the state of the art. We report\nthe test results from 10 runs on CIFAR-10 and validation results from 2 runs on ImageNet.\n\nState of the art\nConvNet Mean Teacher\nResNet Mean Teacher\nState of the art using all labels\n\nCIFAR-10\n4000 labels\n10.55 [15]\n12.31 \u00b1 0.28\n6.28 \u00b1 0.15\n6.28 \u00b1 0.15\n6.28 \u00b1 0.15\n2.86 [5]\n\nImageNet 2012\n10% of the labels\n35.24 \u00b1 0.90 [19]\n9.11 \u00b1 0.12\n9.11 \u00b1 0.12\n9.11 \u00b1 0.12\n3.79 [10]\n\nChanging from MSE to KL-divergence (Figure 4(f)) Following Laine & Aila [13], we use mean\nsquared error (MSE) as our consistency cost function, but KL-divergence would seem a more natural\nchoice. Which one works better? We ran experiments with instances of a cost function family ranging\nfrom MSE (\u2327 = 0 in the \ufb01gure) to KL-divergence (\u2327 = 1), and found out that in this setting MSE\nperforms better than the other cost functions. See Appendix C for the details of the cost function\nfamily and for our intuition about why MSE performs so well.\n\n3.5 Mean Teacher with residual networks on CIFAR-10 and ImageNet\n\nIn the experiments above, we used a traditional 13-layer convolutional architecture (ConvNet), which\nhas the bene\ufb01t of making comparisons to earlier work easy. In order to explore the effect of the model\narchitecture, we ran experiments using a 12-block (26-layer) Residual Network [8] (ResNet) with\nShake-Shake regularization [5] on CIFAR-10. The details of the model and the training procedure\nare described in Appendix B.2. As shown in Table 4, the results improve remarkably with the better\nnetwork architecture.\nTo test whether the methods scales to more natural images, we ran experiments on Imagenet 2012\ndataset [21] using 10% of the labels. We used a 50-block (152-layer) ResNeXt architecture [32],\nand saw a clear improvement over the state of the art. As the test set is not publicly available, we\nmeasured the results using the validation set.\n\n4 Related work\n\nNoise regularization of neural networks was proposed by Sietsma & Dow [25]. More recently, several\ntypes of perturbations have been shown to regularize intermediate representations effectively in\ndeep learning. Adversarial Training [6] changes the input slightly to give predictions that are as\ndifferent as possible from the original predictions. Dropout [27] zeroes random dimensions of layer\noutputs. Dropconnect [30] generalizes Dropout by zeroing individual weights instead of activations.\nStochastic Depth [11] drops entire layers of residual networks, and Swapout [26] generalizes Dropout\nand Stochastic Depth. Shake-shake regularization [5] duplicates residual paths and samples a linear\ncombination of their outputs independently during forward and backward passes.\nSeveral semi-supervised methods are based on training the model predictions to be consistent to\nperturbation. The Denoising Source Separation framework (DSS) [28] uses denoising of latent\nvariables to learn their likelihood estimate. The variant of Ladder Network [20] implements DSS\nwith a deep learning model for classi\ufb01cation tasks. It produces a noisy student predictions and clean\nteacher predictions, and applies a denoising layer to predict teacher predictions from the student\npredictions. The \u21e7 model [13] improves the model by removing the explicit denoising layer and\napplying noise also to the teacher predictions. Similar methods had been proposed already earlier for\nlinear models [29] and deep learning [2]. Virtual Adversarial Training [15] is similar to the \u21e7 model\nbut uses adversarial perturbation instead of independent noise.\nThe idea of a teacher model training a student is related to model compression [3] and distillation [9].\nThe knowledge of a complicated model can be transferred to a simpler model by training the\nsimpler model with the softmax outputs of the complicated model. The softmax outputs contain\nmore information about the task than the one-hot outputs, and the requirement of representing this\n\n7\n\n\fknowledge regularizes the simpler model. Besides its use in model compression, distillation can be\nused to harden trained models against adversarial attacks [17]. The difference between distillation\nand consistency regularization is that distillation is performed after training whereas consistency\nregularization is performed on training time.\nConsistency regularization can be seen as a form of label propagation [33]. Training samples that\nresemble each other are more likely to belong to the same class. Label propagation takes advantage\nof this assumption by pushing label information from each example to examples that are near it\naccording to some metric. Label propagation can also be applied to deep learning models [31].\nHowever, ordinary label propagation requires a prede\ufb01ned distance metric in the input space. In\ncontrast, consistency targets employ a learned distance metric implied by the abstract representations\nof the model. As the model learns new features, the distance metric changes to accommodate these\nfeatures. Therefore, consistency targets guide learning in two ways. On the one hand they spread the\nlabels according to the current distance metric, and on the other hand, they aid the network learn a\nbetter distance metric.\n\n5 Conclusion\n\nTemporal Ensembling, Virtual Adversarial Training and other forms of consistency regularization\nhave recently shown their strength in semi-supervised learning. In this paper, we propose Mean\nTeacher, a method that averages model weights to form a target-generating teacher model. Unlike\nTemporal Ensembling, Mean Teacher works with large datasets and on-line learning. Our experiments\nsuggest that it improves the speed of learning and the classi\ufb01cation accuracy of the trained network.\nIn addition, it scales well to state-of-the-art architectures and large image sizes.\nThe success of consistency regularization depends on the quality of teacher-generated targets. If the\ntargets can be improved, they should be. Mean Teacher and Virtual Adversarial Training represent\ntwo ways of exploiting this principle. Their combination may yield even better targets. There are\nprobably additional methods to be uncovered that improve targets and trained models even further.\n\nAcknowledgements\n\nWe thank Samuli Laine and Timo Aila for fruitful discussions about their work, and Phil Bachman\nand Colin Raffel for corrections to the pre-print version of this paper. We also thank everyone at The\nCurious AI Company for their help, encouragement, and ideas.\n\nReferences\n[1] Abadi, Mart\u00edn, Agarwal, Ashish, Barham, Paul, Brevdo, Eugene, Chen, Zhifeng, Citro, Craig,\nCorrado, Greg S., Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat, Sanjay, Goodfellow,\nIan, Harp, Andrew, Irving, Geoffrey, Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser,\nLukasz, Kudlur, Manjunath, Levenberg, Josh, Man\u00e9, Dan, Monga, Rajat, Moore, Sherry,\nMurray, Derek, Olah, Chris, Schuster, Mike, Shlens, Jonathon, Steiner, Benoit, Sutskever, Ilya,\nTalwar, Kunal, Tucker, Paul, Vanhoucke, Vincent, Vasudevan, Vijay, Vi\u00e9gas, Fernanda, Vinyals,\nOriol, Warden, Pete, Wattenberg, Martin, Wicke, Martin, Yu, Yuan, and Zheng, Xiaoqiang.\nTensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015.\n\n[2] Bachman, Philip, Alsharif, Ouais, and Precup, Doina. Learning with Pseudo-Ensembles.\n\narXiv:1412.4864 [cs, stat], December 2014. arXiv: 1412.4864.\n\n[3] Bucilu\u02c7a, Cristian, Caruana, Rich, and Niculescu-Mizil, Alexandru. Model compression. In\nProceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pp. 535\u2013541. ACM, 2006.\n\n[4] Gal, Yarin and Ghahramani, Zoubin. Dropout as a Bayesian Approximation: Representing\nModel Uncertainty in Deep Learning. In Proceedings of The 33rd International Conference on\nMachine Learning, pp. 1050\u20131059, 2016.\n\n[5] Gastaldi, Xavier. Shake-Shake regularization. arXiv:1705.07485 [cs], May 2017. arXiv:\n\n1705.07485.\n\n8\n\n\f[6] Goodfellow, Ian J., Shlens, Jonathon, and Szegedy, Christian. Explaining and Harnessing\n\nAdversarial Examples. December 2014. arXiv: 1412.6572.\n\n[7] Guo, Chuan, Pleiss, Geoff, Sun, Yu, and Weinberger, Kilian Q. On Calibration of Modern\n\nNeural Networks. arXiv:1706.04599 [cs], June 2017. arXiv: 1706.04599.\n\n[8] He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun, Jian. Deep Residual Learning for\n\nImage Recognition. arXiv:1512.03385 [cs], December 2015. arXiv: 1512.03385.\n\n[9] Hinton, Geoffrey, Vinyals, Oriol, and Dean, Jeff. Distilling the Knowledge in a Neural Network.\n\narXiv:1503.02531 [cs, stat], March 2015. arXiv: 1503.02531.\n\n[10] Hu, Jie, Shen, Li, and Sun, Gang. Squeeze-and-Excitation Networks. arXiv:1709.01507 [cs],\n\nSeptember 2017. arXiv: 1709.01507.\n\n[11] Huang, Gao, Sun, Yu, Liu, Zhuang, Sedra, Daniel, and Weinberger, Kilian. Deep Networks\n\nwith Stochastic Depth. arXiv:1603.09382 [cs], March 2016. arXiv: 1603.09382.\n\n[12] Kingma, Diederik and Ba, Jimmy. Adam: A Method for Stochastic Optimization.\n\narXiv:1412.6980 [cs], December 2014. arXiv: 1412.6980.\n\n[13] Laine, Samuli and Aila, Timo. Temporal Ensembling for Semi-Supervised Learning.\n\narXiv:1610.02242 [cs], October 2016. arXiv: 1610.02242.\n\n[14] Maas, Andrew L., Hannun, Awni Y., and Ng, Andrew Y. Recti\ufb01er nonlinearities improve neural\n\nnetwork acoustic models. In Proc. ICML, volume 30, 2013.\n\n[15] Miyato, Takeru, Maeda, Shin-ichi, Koyama, Masanori, and Ishii, Shin. Virtual Adversarial Train-\ning: a Regularization Method for Supervised and Semi-supervised Learning. arXiv:1704.03976\n[cs, stat], April 2017. arXiv: 1704.03976.\n\n[16] Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco, Alessandro, Wu, Bo, and Ng, Andrew Y.\nReading digits in natural images with unsupervised feature learning. In NIPS Workshop on\nDeep Learning and Unsupervised Feature Learning, 2011.\n\n[17] Papernot, Nicolas, McDaniel, Patrick, Wu, Xi, Jha, Somesh, and Swami, Ananthram. Distillation\nas a Defense to Adversarial Perturbations against Deep Neural Networks. arXiv:1511.04508\n[cs, stat], November 2015. arXiv: 1511.04508.\n\n[18] Polyak, B. T. and Juditsky, A. B. Acceleration of Stochastic Approximation by Averaging.\nSIAM J. Control Optim., 30(4):838\u2013855, July 1992. ISSN 0363-0129. doi: 10.1137/0330046.\n[19] Pu, Yunchen, Gan, Zhe, Henao, Ricardo, Yuan, Xin, Li, Chunyuan, Stevens, Andrew, and\nCarin, Lawrence. Variational Autoencoder for Deep Learning of Images, Labels and Captions.\narXiv:1609.08976 [cs, stat], September 2016. arXiv: 1609.08976.\n\n[20] Rasmus, Antti, Berglund, Mathias, Honkala, Mikko, Valpola, Harri, and Raiko, Tapani. Semi-\nsupervised Learning with Ladder Networks.\nIn Cortes, C., Lawrence, N. D., Lee, D. D.,\nSugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 28,\npp. 3546\u20133554. Curran Associates, Inc., 2015.\n\n[21] Russakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang,\nZhiheng, Karpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Fei-\nFei, Li. ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575 [cs], September\n2014. arXiv: 1409.0575.\n\n[22] Sajjadi, Mehdi, Javanmardi, Mehran, and Tasdizen, Tolga. Regularization With Stochastic Trans-\nformations and Perturbations for Deep Semi-Supervised Learning. In Lee, D. D., Sugiyama, M.,\nLuxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing\nSystems 29, pp. 1163\u20131171. Curran Associates, Inc., 2016.\n\n[23] Salimans, Tim and Kingma, Diederik P. Weight normalization: A simple reparameterization\nto accelerate training of deep neural networks. In Advances in Neural Information Processing\nSystems, pp. 901\u2013901, 2016.\n\n9\n\n\f[24] Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, and Chen,\nXi. Improved techniques for training gans. In Advances in Neural Information Processing\nSystems, pp. 2226\u20132234, 2016.\n\n[25] Sietsma, Jocelyn and Dow, Robert JF. Creating arti\ufb01cial neural networks that generalize. Neural\n\nnetworks, 4(1):67\u201379, 1991.\n\n[26] Singh, Saurabh, Hoiem, Derek, and Forsyth, David. Swapout: Learning an ensemble of deep\n\narchitectures. arXiv:1605.06465 [cs], May 2016. arXiv: 1605.06465.\n\n[27] Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov,\nRuslan. Dropout: A Simple Way to Prevent Neural Networks from Over\ufb01tting. J. Mach. Learn.\nRes., 15(1):1929\u20131958, January 2014. ISSN 1532-4435.\n\n[28] S\u00e4rel\u00e4, Jaakko and Valpola, Harri. Denoising Source Separation. Journal of Machine Learning\n\nResearch, 6(Mar):233\u2013272, 2005. ISSN ISSN 1533-7928.\n\n[29] Wager, Stefan, Wang, Sida, and Liang, Percy. Dropout Training as Adaptive Regularization.\n\narXiv:1307.1493 [cs, stat], July 2013. arXiv: 1307.1493.\n\n[30] Wan, Li, Zeiler, Matthew, Zhang, Sixin, Le Cun, Yann, and Fergus, Rob. Regularization of\n\nNeural Networks using DropConnect. pp. 1058\u20131066, 2013.\n\n[31] Weston, Jason, Ratle, Fr\u00e9d\u00e9ric, Mobahi, Hossein, and Collobert, Ronan. Deep learning via\nsemi-supervised embedding. In Neural Networks: Tricks of the Trade, pp. 639\u2013655. Springer,\n2012.\n\n[32] Xie, Saining, Girshick, Ross, Doll\u00e1r, Piotr, Tu, Zhuowen, and He, Kaiming. Aggregated\nResidual Transformations for Deep Neural Networks. arXiv:1611.05431 [cs], November 2016.\narXiv: 1611.05431.\n\n[33] Zhu, Xiaojin and Ghahramani, Zoubin. Learning from labeled and unlabeled data with label\n\npropagation. 2002.\n\n10\n\n\f", "award": [], "sourceid": 789, "authors": [{"given_name": "Antti", "family_name": "Tarvainen", "institution": "The Curious AI Company"}, {"given_name": "Harri", "family_name": "Valpola", "institution": "The Curious AI Company"}]}