如何为二进制分类选择平衡采样?

时间:2016-07-01 03:58:43

标签: apache-spark machine-learning apache-spark-mllib

有我的代码,从配置单元加载数据,并进行样本平衡:

// Load SubSet Data
val dataList = DataLoader.loadSubTrainTestData(hiveContext.sql(sampleDataHql))

// Split Data to Train and Test
val data = dataList.randomSplit(Array(0.7, 0.3), seed = 11L)

// Random balance train data
val sampleCount = data(0).map(rec => (rec.label, 1)).reduceByKey(_ + _)

val positiveSample = data(0).filter(_.label == 1).cache()
val positiveSize = positiveSample.count()

val negativeSample = data(0).filter(_.label == 0).cache()
val negativeSize = negativeSample.count()

// Build train data
val trainData = positiveSample ++
negativeSample.sample(withReplacement = false, 1.0 * positiveSize.toFloat / negativeSize, System.nanoTime())

// Data size
val trainDataSize = positiveSize + negativeSize
val testDataSize = trainDataSize * 3.0 / 7.0

我计算trainDataSize和testDataSize以评估模型置信度

1 个答案:

答案 0 :(得分:2)

好的我没有测试过这段代码,但应该是这样的:

val data: RDD[LabeledPoint] = ???

val fractions: Map[Double, Double] = Map(0.0 -> 0.5, 1.0 -> 0.5)
val sampledData: RDD[LabeledPoint] = data
  .keyBy(_.label)
  .sampleByKeyExact(false, fractions)  // Optionally with seed
  .values

您可以将LabeledPoint转换为PairRDD,而不是使用您想要使用的分数应用sampleByKeyExact。