我注意到计算模型精度的时间几乎与创建模型本身的时间一样长,这看起来并不合适。我有一个包含六个虚拟机的集群。最昂贵的时间是来自"的第一次迭代,对于范围内的项目(numClasses)"环。什么rdd操作应该在这背后发生?
代码:
%pyspark
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import UserDefinedFunction
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.tree import DecisionTree
from pyspark.mllib.evaluation import MulticlassMetrics
from timeit import default_timer
def decision_tree(train,test,numClasses,CatFeatInf):
ref = default_timer()
training_data = train.rdd.map(lambda row: LabeledPoint(row[-1], row[:-1])).persist(StorageLevel.MEMORY_ONLY)
testing_data = test.rdd.map(lambda row: LabeledPoint(row[-1], row[:-1])).persist(StorageLevel.MEMORY_ONLY)
print 'transformed in dense data in: %.3f seconds'%(default_timer()-ref)
ref = default_timer()
model = DecisionTree.trainClassifier(training_data,
numClasses=numClasses,
maxDepth=7,
categoricalFeaturesInfo=CatFeatInf,
impurity='entropy', maxBins=max(CatFeatInf.values()))
print 'model created in: %.3f seconds'%(default_timer()-ref)
ref = default_timer()
predictions_and_labels = model.predict(testing_data.map(lambda r: r.features)).zip(testing_data.map(lambda r: r.label))
print 'predictions made in: %.3f seconds'%(default_timer()-ref)
ref = default_timer()
metrics = MulticlassMetrics(predictions_and_labels)
res = {}
for item in range(numClasses):
try:
res[item] = metrics.precision(item)
except:
res[item] = 0.0
print 'accuracy calculated in: %.3f seconds'%(default_timer()-ref)
return res
在密集数据中转换:0.074秒
模型创建于:355.276秒
预测:0.095秒
准确度计算方法:346.497秒
答案 0 :(得分:0)
有可能是我第一次调用metrics时执行的一些未完成的rdd操作.precision(0)