Recommendation system (V) wide & deep
Recommended system series blog:
- Recommendation system (I) overall overview of recommendation system
- Recommended system (II) GBDT+LR model
- Recommended system (III) Factorization Machines (FM)
- Recommended system (IV) field aware factorization machines (FFM)
This blog mainly introduces an article published by Google on RecSys in 2016. As the saying goes: produced by Google, it must be a boutique. The model wide & deep proposed in this article has a great impact on the field of recommendation system, and inspired some work in the field of recommendation system in the next few years, such as deep & Cross, deepFM, etc. This article also adheres to the consistent style of G's articles [the road is simple and pays great attention to engineering practice]. Unlike the paper of a team of an Internet company in China, it is a pile of concepts. If there is no concept, it will create a concept, which is dazzling and obviously aimed at sending comments. This blog mainly introduces the wide & deep model in the following points.
1, Motivation to propose wide & deep
Prior to this paper, the mainstream models recommended by the industry are basically LR or ordinary DNN (of course, there are also FM and tree models and their variants). Generally speaking, linear models such as LR are better at memory, while DNN is better at generalization. Here to explain what is memory and what is generalization.
- Memory: in fact, the model directly learns some strong features of the s amp le, such as diapers and bottles of wine. If people who buy diapers frequently buy beer at the same time, that is, the frequency of diaper beer co-occurrence is very high, the model can directly remember this co-occurrence. When recommending, if a person buys diapers, Then you should recommend him beer.
- Generalization: generalization is the transmission ability of learning features. For example, a training set is a leaf, and label is to judge whether the input is a leaf. The leaves in the sample are basically green serrated leaves. Then we hope the model can have generalization ability. When a yellowing circular leaf is input, the model can also judge that it is a leaf.
Therefore, in order to combine these two capabilities, a wide deep model is proposed. The wide side is an LR model, which is responsible for memory, and the deep part is a multi-layer fully connected network, which is responsible for generalization.
2, Wide deep model structure
It can be clearly seen from the figure above that the wide & deep model consists of two parts, the wide part and the deep part. The wide part is a linear model LR, and the deep part is a DNN with three hidden layers [1024512256]. Although the model network structure is straightforward, there are several details to pay attention to:
- Training method: the wide part and the deep part are connected together by logic loss. They are trained jointly through bp instead of separately, and then they are ensemble d.
- Optimization algorithm: the wide part adopts the FTRL optimization algorithm produced by G family (a separate blog will be written later on FTRL), and the deep part adopts AdaGrad (for AdaGrad, see my blog: Optimization methods in deep learning -- momentum, Nesterov Momentum, AdaGrad, Adadelta, RMSprop, Adam )Then the following two questions: (a) how to optimize two different optimization algorithms together? (b) Why does the LR part use ftrl and the deep part use AdaGrad. These two questions will be introduced in [some thoughts] at the end of the blog.
- Features: in the Google play scenario, it is mainly divided into continuous value features and category features. In this paper, the continuous value features are normalized to the range of [0,1], and the category features are embedded, then concate nated directly and input into the deep network. The wide part mainly inputs some cross features for memory.
- Model training: train the model in the way of hot start, and enter a warm starting system which initializes a new model with the embedding s and the linear model weights from the previous model. This is consistent with abacus in our factory. The benefits are obvious. lifetime learning + Fine Tune.
- Model update verification: verify whether the model to be updated is normal before updating to the online, so as to prevent problems from updating the problematic model to the online. This reveals a strong industrial wind and is also an essential link of an industrial system.
3, Some thinking
Question 1: the wide part and deep part adopt joint training, but the wide part adopts FTRL optimization algorithm and the deep part adopts AdaGrad optimization algorithm. How to train?
For this question, look directly at the official code of TensorFlow: https://github.com/tensorflow/tensorflow/blob/r1.11/tensorflow/python/estimator/canned/dnn_linear_combined.py
deep side:
# deep side with variable_scope.variable_scope( dnn_parent_scope, values=tuple(six.itervalues(features)), partitioner=dnn_partitioner) as scope: dnn_absolute_scope = scope.name dnn_logit_fn = dnn._dnn_logit_fn_builder( # pylint: disable=protected-access units=head.logits_dimension, hidden_units=dnn_hidden_units, feature_columns=dnn_feature_columns, activation_fn=dnn_activation_fn, dropout=dnn_dropout, input_layer_partitioner=input_layer_partitioner, batch_norm=batch_norm) dnn_logits = dnn_logit_fn(features=features, mode=mode)
wide side:
# wide side with variable_scope.variable_scope( linear_parent_scope, values=tuple(six.itervalues(features)), partitioner=input_layer_partitioner) as scope: linear_absolute_scope = scope.name logit_fn = linear._linear_logit_fn_builder( # pylint: disable=protected-access units=head.logits_dimension, feature_columns=linear_feature_columns, sparse_combiner=linear_sparse_combiner) linear_logits = logit_fn(features=features)
Loss directly adds the loss of the wide part and the loss of the deep part
# loss function if n_classes == 2: head = head_lib._binary_logistic_head_with_sigmoid_cross_entropy_loss( # pylint: disable=protected-access weight_column=weight_column, label_vocabulary=label_vocabulary, loss_reduction=loss_reduction) else: head = head_lib._multi_class_head_with_softmax_cross_entropy_loss( # pylint: disable=protected-access n_classes, weight_column=weight_column, label_vocabulary=label_vocabulary, loss_reduction=loss_reduction) # Combine logits and build full model. if dnn_logits is not None and linear_logits is not None: logits = dnn_logits + linear_logits elif dnn_logits is not None: logits = dnn_logits else: logits = linear_logits
*During BP, the wide side and deep side are optimized with an unused optimizer. The core statement here is: train_op = control_flow_ops.group(train_ops), so that different optimizers can be used to optimize both sides.
def _train_op_fn(loss): """Returns the op to optimize the loss.""" train_ops = [] global_step = training_util.get_global_step() if dnn_logits is not None: train_ops.append( dnn_optimizer.minimize( loss, var_list=ops.get_collection( ops.GraphKeys.TRAINABLE_VARIABLES, scope=dnn_absolute_scope))) if linear_logits is not None: train_ops.append( linear_optimizer.minimize( loss, var_list=ops.get_collection( ops.GraphKeys.TRAINABLE_VARIABLES, scope=linear_absolute_scope))) # Core statement, using group function train_op = control_flow_ops.group(*train_ops) with ops.control_dependencies([train_op]): return state_ops.assign_add(global_step, 1).op return head.create_estimator_spec( features=features, mode=mode, labels=labels, train_op_fn=_train_op_fn, logits=logits)
Question 2: why use FTRL on the wide side and AdaGrad on the deep side?
Personally, I think ftrl is used on the wide side, on the one hand, to produce sparse solution. After all, in the paper, we can see that the cross feature of the wide part is the cross of two id features, which can greatly reduce the model volume and facilitate online deployment. On the other hand, due to joint training, the convergence speed of the wide part is bound to be much faster than that of the deep part. It should be Google bosses to alleviate this situation, because ftrl and adagrad accumulate with the gradient, the learning rate will decrease, and ftrl combines L1 regularization and L2 regularization, which makes the convergence speed of ftrl slow.
Note: This is only a personal understanding. If there is a big man who has a better understanding, please leave a message.
reference