Question

当运行Spark的RandomForest算法时，即使使用相同的种子，我似乎在不同的运行中在树中得到不同的分割。任何人都可以解释我是否做错了（可能），或者实施是错误的（我认为不太可能）？这是我的计划：

//read data into rdd
//convert string rdd to LabeledPoint rdd
// train_LP_RDD is RDD of LabeledPoint
// call random forest
val seed = 123417
val numTrees = 10
val numClasses = 2
val categoricalFeaturesInfo: Map[Int, Int] = Map() 
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 8
val maxBins = 10
val rfmodel = RandomForest.trainClassifier(train_LP_RDD, numClasses, categoricalFeaturesInfo,
                        numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins,seed)
println(rfmodel.toDebugString)

在两个不同的运行中，此代码段的输出是不同的。例如，两个结果的差异显示如下：

sdiff -bBWs run1.debug run2.debug

If (feature 2 <= 15.96)             |         If (feature 2 <= 16.0)
Else (feature 2 > 15.96)            |         Else (feature 2 > 16.0)
If (feature 2 <= 15.96)             |         If (feature 2 <= 16.0)
Else (feature 2 > 15.96)            |         Else (feature 2 > 16.0)
If (feature 2 <= 33.68)             |         If (feature 2 <= 34.66)
Else (feature 2 > 33.68)            |         Else (feature 2 > 34.66)
If (feature 1 <= 17.0)              |         If (feature 1 <= 16.0)
Else (feature 1 > 17.0)             |         Else (feature 1 > 16.0)

Answer 1

如果没有更多的上下文信息（并且没有足够的评论发表意见）就无法说出，但是正如Shaido所建议的，一个原因可能是train_LP_RDD是不确定的。例如。如果您正在做类似的事情

train_LP_RDD = sc.textFile(path).sample(withReplacement=False, fraction=0.5)

那么，即使您没有重新定义trainClassifier，每次运行train_LP_RDD时，您都会得到不同的样本。

火花随机森林：同一种子的不同结果

1 个答案: