我正在尝试实施一种自学习方法来训练分类器。我正在使用spark 1.6.0。问题是当我将RDD映射到另一个时,我得到错误的计数。对于小型数据集,相同的代码适用于较大的数据集,它只是疯了。
println("INITIAL TRAINING SET SIZE : " + trainingSetInitial.count())
for(counter <- 1 to 10){
println("------------------- This is the_" + counter + " run -----------------")
println("TESTING SET SIZE : " + testing.count())
val lowProbabilitiesSet = testing.flatMap { item =>
if (model.predictProbabilities(item._2)(0) <= 0.75 && model.predictProbabilities(item._2)(1) <= 0.75) {
List(item._1)
} else {
None
}}.cache()
val highProbabilitiesSet = testing.flatMap { item =>
if (model.predictProbabilities(item._2)(0) > 0.75 || model.predictProbabilities(item._2)(1) > 0.75 ) {
List(item._1 +","+ model.predict(item._2).toDouble )
} else {
None
}}.cache()
println("LOW PROBAB SET : " + lowProbabilitiesSet.count())
println("HIGH PROBAB SET : " + highProbabilitiesSet.count())
trainingSetInitial = trainingSetInitial.union(highProbabilitiesSet.map(x => LabeledPoint(List(x)(0).split(",")(8).toString.toDouble, htf.transform(List(x)(0).toString.split(",")(7).split(" ") ))))
model = NaiveBayes.train(trainingSetInitial, lambda = 1.0)
println("NEW TRAINING SET : " + trainingSetInitial.count())
previousCount = lowProbabilitiesSet.count()
testing = lowProbabilitiesSet.map { line =>
val parts = line.split(',')
val text = parts(7).split(' ')
(line, htf.transform(text))
}
testing.checkpoint()
}
这是来自正确输出的日志:
初始培训设定规模:238.182
-------------------这是_1 跑-----------------
测试设置尺寸:3.158.722
LOW PROBAB SET:22.996
HIGH PROBAB SET:3.135.726
新训练集:3373908
-------------------这是2 跑-----------------
测试设定尺寸:22996
LOW PROBAB SET:566
HIGH PROBAB SET:22430
新训练集:3396338
这是问题开始的时候(大数据集输入):
初始培训设置大小: 31.990.660
-------------------这是_1 跑-----------------
测试设定尺寸:423.173.780
LOW PROBAB SET:62.615.460
HIGH PROBAB SET:360.558.320
新训练集:395265857
-------------------这是2 跑-----------------
测试集尺寸:52673986
LOW PROBAB SET:51460875
HIGH PROBAB SET:1213111
新训练集:401950263
&#39; LOW PROBAB SET&#39;在第一次迭代中应该是&#39; TESTING SET&#39;对于第二次迭代。在某个地方,不知何故,1000万条目消失了。此外,还有“新培训套餐”。第一次迭代应该是“初始培训”的连接。以及“高级证据集”。数字再次匹配。
代码运行时我没有收到任何错误。我尝试在每次迭代结束时缓存每个集合并取消分配(仅限HIGH和LOW集合)但结果相同。我也尝试检查这些套装,但没有用。为什么会这样?
修改
仅仅是为了测试我没有在循环中创建一个新模型,只是为了看看会发生什么:
for(counter <- 1 to 5){
println("------------------- This is the_" + counter + " run !!! -----------------")
var updated_trainCnt = temp_train.count();
var updated_testCnt = test_set.count();
println("Updated Train SET SIZE: " + updated_trainCnt)
println("Updated Testing SET SIZE: " + updated_testCnt)
val highProbabilitiesSet = test_set.filter { item =>
val output = model.predictProbabilities(item._2)
output(0) > 0.75 || output(1) > 0.75
}.map(item => (item._1 + "," + model.predict(item._2), item._2 )).cache()
test_set = test_set.filter { item =>
val output = model.predictProbabilities(item._2)
output(0) <= 0.75 && output(1) <= 0.75
}.map(item => (item._1, item._2)).cache()
var hiCnt = highProbabilitiesSet.count()
var lowCnt = test_set.count()
println("HIGH PROBAB SET : " + hiCnt)
println("LOW PROBAB SET : " + lowCnt)
var diff = updated_testCnt - hiCnt - lowCnt
if (diff!=0) println("ERROR: Test set not correctly split into high low" + diff)
temp_train= temp_train.union(highProbabilitiesSet.map(x => LabeledPoint(x._1.toString.split(",")(8).toDouble, x._2 ))).cache()
println("NEW TRAINING SET: " + temp_train.count())
// model = NaiveBayes.train(temp_train, lambda = 1.0, modelType = "multinomial")
println("HIGH PROBAB SET : " + highProbabilitiesSet.count())
println("LOW PROBAB SET : " + test_set.count())
println("NEW TRAINING SET: " + temp_train.count())
}
从原始模型生成的数字都可以,甚至RDD的联合也没有问题。但是最大的问题仍然是,分类模型如何在每个循环(或其他RDD)结束时甚至不修改训练集(lowProbabilititesSet)?
控制台日志和火花日志不显示任何错误或刽子手粉碎。分类培训过程如何破坏我的数据?
答案 0 :(得分:0)
即使我还没弄清楚为什么这会发生在黑客攻击中我将RDD刷新到HDFS并制作了一个bash脚本,它迭代地运行该类并且每次都从HDFS读取数据。正如我所知,当我在循环中训练分类器时出现问题。
答案 1 :(得分:-2)
我没有立即看到问题。请尽量减少实际问题的代码。首先我建议将flatMap
操作重写为filter
:
val highProbabilitiesSet = testing.flatMap { item =>
if (model.predictProbabilities(item._2)(0) > 0.75 || model.predictProbabilities(item._2)(1) > 0.75 ) {
List(item._1 +","+ model.predict(item._2).toDouble )
} else {
None
}
}.cache()
要:
val highProbabilitiesSet = testing.filter { item =>
val output = model.predictProbabilities(item._2)
output(0) > 0.75 || output(1) > 0.75
}.map(item => (item._1, model.predict(item._2).toDouble)).cache()