Question

训练了一个RandomForest（Spark 1.6.0）

val numClasses = 4 // 0-2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 9
val featureSubsetStrategy = "auto" // Let the algorithm choose.
val impurity = "gini"
val maxDepth = 6
val maxBins = 32

val model = RandomForest.trainClassifier(trainRDD, numClasses, 
                                         categoricalFeaturesInfo, numTrees, 
                                         featureSubsetStrategy, impurity, 
                                         maxDepth, maxBins)

输入标签：

labels = labeledRDD.map(lambda lp: lp.label).distinct().collect()
for label in sorted(labels):
    print label

0.0
1.0
2.0

但输出只包含两个类：

metrics = MulticlassMetrics(labelsAndPredictions)
df_confusion = metrics.confusionMatrix()
display_cm(df_confusion)

输出：

83017.0  81.0    0.0
8703.0   2609.0  0.0
10232.0  255.0   0.0

从我在pyspark中加载相同模型并针对其他数据（上述部分）运行时的输出

DenseMatrix([[  1.75280000e+04,   3.26000000e+02],
            [  3.00000000e+00,   1.27400000e+03]])

Answer 1

它变得更好......我使用pearson相关性来确定哪些列没有任何相关性。删除十个最低的相关列，现在我得到了正确的结果：

Test Error = 0.0401823
precision = 0.959818
Recall = 0.959818

ConfusionMatrix([[ 17323.,      0.,    359.],
                 [     0.,   1430.,     92.],
                 [   208.,    170.,   1049.]])

Spark RandomForest分类器numClasses

1 个答案: