Sample sampling processing

Posted by krabople on Tue, 01 Feb 2022 14:01:31 +0100

1. Background

In the recommendation system, the common targets are ctr and cvr. The data of these two targets are tilted seriously. If the sample is not sampled, the model is easy to learn bias, resulting in unstable online effect and poor generalization ability.

2. Sample sampling and processing

2.1 sample purification

Usually, the pulled behavior data may cause some small problems in the sample data for various reasons (buried point, delayed reporting, etc.). The most common is that the same data will appear in both positive and negative samples, which is a problem that is easy to ignore. Although the impact is small (the magnitude is relatively small), it is better to avoid it as much as possible.
If the sample is repeated, directly discard the negative sample and keep the positive sample; If the sample size is enough, it's OK to throw them away.
When associating features, due to different time windows, different situations may occur in the feature values (such as the numerical feature of hits). It is good to take the largest.

2.2 random sampling

This method is the most common and direct. It generally operates on negative samples and randomly retains some features.

val sample = rdd.map(x => {
    val parts = x.split("\t")
    parts
})
val posRdd = sample.filter(x => x(labelIdx) == "1")
val negRdd = sample.filter(x => x(labelIdx) == "0")
val posNum = posRdd.count()
val negNum = negRdd.count()

val targetNegRatio = math.min(negNum, posNum*ratio)/negNum

val targetNegRdd = negRdd.sample(false, targetNegRatio, 2021)
val targetSampleRdd = posRdd.union(targetNegRdd).map(x => x.mkString("\t"))
  .sample(false, 1, 2021)

2.3 user dimension sampling

The granularity of user dimension sampling is finer than random sampling, and the proportion of positive and negative samples produced by each user does not exceed a certain threshold at most α \ alpha α ， For users without positive samples, there is a probability β \ beta β Keep a sample and set the two thresholds by yourself.

2.4 click dimension sampling

Negative samples are only produced for some users with clicks (positive samples). If there is no positive sample, negative samples will not be produced.

val labelIdx = getIndex(sampleFormat, "label")
val uidIdx = getIndex(sampleFormat, "uid")

val sample = rdd.map(x => {
  val parts = x.split("\t")
  parts
})

val posRdd = sample.filter(x => x(labelIdx) == "1").map(x => (x(uidIdx), x))
val negRdd = sample.filter(x => x(labelIdx) == "1").map(x => (x(uidIdx), x))

val _SampleRdd = posRdd.join(negRdd)
  .map(x => Array(x._2._1, x._2._2)).flatMap(x => x)

val _targetPosRdd = _SampleRdd.filter(x => x(labelIdx) == "1")
val _targetNegRdd = _SampleRdd.filter(x => x(labelIdx) == "0")
val posNum = _targetPosRdd.count()
val negNum = _targetNegRdd.count()

val targetNegRatio = math.min(negNum, posNum*ratio)/negNum

val targetNegRdd = _targetNegRdd.sample(false, targetNegRatio, 2021)
val targetSampleRdd = _targetPosRdd.union(targetNegRdd).map(x => x.mkString("\t"))
  .sample(false, 1, 2021)

2.5 scene sample reweight

In many cases, the traffic of a recommended scene is not large. If the scene data is used only to train the model, it may not converge and the effect is poor.
The common way is to use the data of the whole station to train the model, which can ensure that the model is fully trained. But it will also bring a problem: if the user behavior distribution of each scene of the whole station is different, this method will also have a certain impact on the model. The way to improve this impact is to reweight the samples of the scene

val labelIdx = getIndex(sampleFormat, "label")
val sceneIdx = getIndex(sampleFormat, "scene")

val sample = rdd.map(x => {
  val parts = x.split("\t")
  parts
})
val sceneSample = sample.filter(x => x(sceneIdx) == scene)
  .sample(true, reweightRatio, 2021)
val otherSample = sample.filter(x => x(sceneIdx) != scene)

val scenePosSample = sceneSample.filter(x => x(labelIdx) == "1")
val sceneNegSample = sceneSample.filter(x => x(labelIdx) == "0")

val otherPosSample = otherSample.filter(x => x(labelIdx) == "1")
val otherNegSample = otherSample.filter(x => x(labelIdx) == "0")

val scenePosNum = scenePosSample.count()
val sceneNegNum = sceneNegSample.count()
val otherPosNum = otherPosSample.count()
val otherNegNum = otherNegSample.count()

val targetPosNum = scenePosNum + otherPosNum
val targetNegNum = targetPosNum * ratio

val targetPosSample = scenePosSample.union(otherPosSample)
val targetNegSample = sceneNegSample.union(otherNegSample)
  .sample(false, targetNegNum/(sceneNegNum+otherNegNum), 2021)

val targetSample = targetPosSample.union(targetNegSample)
  .sample(false, 1, 2021)
  .map(x => x.mkString("\t"))

2.6 conversion sample reweight

In many recommended scenarios, there may be multiple core business indicators, such as click, transformation, like, collection, etc. generally, click samples are the most, and other behavior samples are few. If you only use ctr model in training model, you can consider using other behaviors to reweight click samples, which changes the actual behavior distribution of samples, Subjectively, adding more high-quality samples often plays a role.

val labelIdx = getIndex(sampleFormat, "label")
val orderIdx = getIndex(sampleFormat, "order")

val sample = rdd.map(x => {
  val parts = x.split("\t")
  parts
})
val orderSample = sample.filter(x => x(orderIdx) == "1")
  .sample(true, reweightRatio, 2021)
val otherSample = sample.filter(x => x(orderIdx) != "1")

val orderPosSample = orderSample.filter(x => x(labelIdx) == "1")
val orderNegSample = orderSample.filter(x => x(labelIdx) == "0")
val otherPosSample = otherSample.filter(x => x(labelIdx) == "1")
val otherNegSample = otherSample.filter(x => x(labelIdx) == "0")

val posSample = orderPosSample.union(otherPosSample)
val negSample = orderNegSample.union(otherNegSample)

val posNum = posSample.count()
val negNum = negSample.count()

val targetSample = negSample.sample(false, posNum*sampleRatio/negNum, 2021)
  .union(posSample)
  .sample(false, 1, 2021)
  .map(x => x.mkString("\t"))

Topics: Machine Learning AI Data Mining

Programmer Think