Question

我试图理解下面的代码如何工作（如果）。特别是，我不明白的是，为什么这个代码确定 - 可以正确 - 在映射之后保留RDD中元素的顺序。这实质上是这里提出的同一问题的一个例子Mind blown: RDD.zip() method。我不明白为什么/最后一行如何确保zip实际上使用testData RDD中的相应标签来压缩正确的预测？其中一条评论提到如果在这种情况下RDD，testData以某种方式排序，那么map将保留该顺序。然而，预测是一个完全不同的RDD ..我无法看到它是如何或为何起作用的！

from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
## Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = labeledDataRDD.randomSplit([0.7, 0.3])
## Train a RandomForest model
model = RandomForest.trainClassifier(trainingData, numClasses=2510,
                     categoricalFeaturesInfo={},numTrees=100,
                     featureSubsetStrategy="auto",
                     impurity='gini', maxDepth=4, maxBins=32)

# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)

Spark变换和RDD元素排序的保存

0 个答案: