Pyspark中随机森林的评估不正确

时间:2017-11-09 04:12:21

标签: pyspark random-forest pyspark-sql

我正在使用Logistic回归和随机森林对电信流失数据集进行预测。

请在此处找到笔记本中的代码段:

data=spark.read.csv("D:\Shashank\CBA\Pyspark\Telecom_Churn_Data_SingTel.csv", header=True, inferSchema=True)
data.show(3)

This link is to show the kind of data i am dealing with on a high level

data=data.drop("State").drop("Area Code").drop("Phone Number")
from pyspark.ml.feature import StringIndexer, VectorAssembler
intlPlanIndex = StringIndexer(inputCol="International Plan", outputCol="International Plan Index")
voiceMailPlanIndex = StringIndexer(inputCol="Voice mail Plan", outputCol="Voice mail Plan Index")
churnIndex = StringIndexer(inputCol="Churn", outputCol="label")
othercols=["Account Length", "Num of Voice mail Messages","Total Day Minutes", "Total Day Calls", "Total day Charge","Total Eve Minutes","Total Eve Calls","Total Eve Charge","Total Night Minutes","Total Night Calls ","Total Night Charge","Total International Minutes","Total Intl  Calls","Total Intl Charge","Number Customer Service calls "]
assembler = VectorAssembler(inputCols= ['International Plan Index'] + ['Voice mail Plan Index'] + othercols, outputCol="features")
(train, test) = data.randomSplit([0.8,0.2])
from pyspark.ml.classification import LogisticRegression
lrObj = LogisticRegression(labelCol='label', featuresCol='features')
from pyspark.ml.pipeline import Pipeline
pipeline = Pipeline(stages=[intlPlanIndex, voiceMailPlanIndex, churnIndex, assembler, lrObj])
lrModel = pipeline.fit(train)
prediction_train = lrModel.transform(train)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
lr_Evaluator = MulticlassClassificationEvaluator()
lr_Evaluator.evaluate(prediction_train)

This image shows the result on evaluation using logistic Regression

然后我使用随机森林分类模型重复相同的操作: 我估计 94.4% 我的结果有点像这样: Link to my Random Forest evaluation result

到目前为止,一切看起来还不错。 但我很好奇看到事情的实际预测,所以我使用下面的代码打印我的预测值:

selected = prediction_1.select("features", "Label", "Churn", "prediction")
for row in selected.collect():
    print(row)

我得到的结果有点像下面的截图: Link to image that shows the 2 results printed out for manual analysis

然后我将上面链接中显示的两个单元格复制到压缩器中,看看我的预测值是否不同。 (我预计会有一些差异,因为对随机森林的评估结果更好)

但是对任何工具的比较表明预测是相同的。然而,评估结果显示LogisticRegression的差异为83.6%,使用RandomForest的差异为94.4%。

当使用MuticlassClassificationEvaluator进行的最终评估给出不同的概率时,为什么我从2个不同模型生成的2组数据没有区别?

2 个答案:

答案 0 :(得分:0)

您似乎对metricName="accuracy"

感兴趣
predictions = model.transform(test)
evaluator = MulticlassClassificationEvaluator(labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)

有关详细信息,请参阅official documentation

答案 1 :(得分:0)

这个问题不再适用,因为我能够看到预测的差异,这与每个模型下预测的准确度一致。 问题出现了,因为我从Jupyter笔记本中复制的数据不完整。

谢谢你,感谢你的时间。