我确实在pyspark和下面的代码中应用了Lojistic回归;
import numpy
from pyspark.ml.feature import RFormula
from pyspark.ml.classification import
BinaryLogisticRegressionSummary,LogisticRegression
from pyspark.ml.evaluation import
(BinaryClassificationEvaluator,MulticlassClassificationEvaluator)
然后创建数据框
df = spark.createDataFrame([(0,0,1),
(1,1,0)]
, ['label', 'X1', 'X2'])
应用R公式
formula = RFormula(formula="label ~ X1+X2")
output = formula.fit(df).transform(df)
output.show()
然后应用模型
df_log=output.select([c for c in output.columns if c in
{'label','features'}])
final_model=LogisticRegression()
fit_final_model=final_model.fit(df_log)
predictions_and_labels=fit_final_model.evaluate(df_log)
pred=predictions_and_labels.predictions.show(1,truncate=False)
下面是输出(我将小数点四舍五入)
+-----+---------+----------------------------------------+------------------
|label|features |rawPrediction |probability |prediction|
|0 |[0.0,1.0] |[18.930,-18.93] |[0.99,6.00E-9]|0.0 |
|1 |[1.0,0.0] |[-18.93,18.930] |[6.00E-9,0.99]|1.0 |
+-----+---------+----------------------------------------+-----------------
现在我的问题是:
1)预测是否基于第二概率值进行分配?如果是,为什么?这个数组是做什么用的?
2)如果是,如何将这些概率作为列添加到数据框中?
3)我也想在数据框上添加预测作为列吗?下面是我尝试过的方法,但它给了我一个错误
df.withColumn('prediction', pred.prediction)
4)如何在输出中舍入值?
谢谢。