pypsark logistic回归概率预测

时间:2018-11-03 11:22:43

标签: pyspark probability logistic-regression prediction

我确实在pyspark和下面的代码中应用了Lojistic回归;

import numpy
from pyspark.ml.feature import RFormula
from pyspark.ml.classification import 
BinaryLogisticRegressionSummary,LogisticRegression
from pyspark.ml.evaluation import 
(BinaryClassificationEvaluator,MulticlassClassificationEvaluator) 

然后创建数据框

df = spark.createDataFrame([(0,0,1),
                          (1,1,0)]
                       , ['label', 'X1', 'X2'])

应用R公式

formula = RFormula(formula="label ~ X1+X2")
output = formula.fit(df).transform(df)
output.show()

然后应用模型

df_log=output.select([c for c in output.columns if c in 
       {'label','features'}])
final_model=LogisticRegression()
fit_final_model=final_model.fit(df_log)
predictions_and_labels=fit_final_model.evaluate(df_log)
pred=predictions_and_labels.predictions.show(1,truncate=False)

下面是输出(我将小数点四舍五入)

+-----+---------+----------------------------------------+------------------ 
|label|features           |rawPrediction     |probability  |prediction|                             

 |0   |[0.0,1.0]          |[18.930,-18.93]  |[0.99,6.00E-9]|0.0       |
 |1   |[1.0,0.0]          |[-18.93,18.930]  |[6.00E-9,0.99]|1.0       |
 +-----+---------+----------------------------------------+-----------------

现在我的问题是:

1)预测是否基于第二概率值进行分配?如果是,为什么?这个数组是做什么用的?

2)如果是,如何将这些概率作为列添加到数据框中?

3)我也想在数据框上添加预测作为列吗?下面是我尝试过的方法,但它给了我一个错误

df.withColumn('prediction', pred.prediction)

4)如何在输出中舍入值?

谢谢。

0 个答案:

没有答案