如何在Pyspark的LogisticRegressionWithLBFGS中打印预测概率

时间:2015-11-06 06:33:29

标签: apache-spark machine-learning pyspark apache-spark-mllib logistic-regression

我正在使用Spark 1.5.1和 在pyspark中,在我使用:

拟合模型之后
model = LogisticRegressionWithLBFGS.train(parsedData)

我可以使用以下方式打印预测:

model.predict(p.features)

是否还有一个功能来打印概率分数以及预测?

2 个答案:

答案 0 :(得分:7)

您必须先clear the threshold,这仅适用于二进制分类:

 from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel
 from pyspark.mllib.regression import LabeledPoint

 parsed_data = [LabeledPoint(0.0, [4.6,3.6,1.0,0.2]),
                LabeledPoint(0.0, [5.7,4.4,1.5,0.4]),
                LabeledPoint(1.0, [6.7,3.1,4.4,1.4]),
                LabeledPoint(0.0, [4.8,3.4,1.6,0.2]),
                LabeledPoint(1.0, [4.4,3.2,1.3,0.2])]   

 model = LogisticRegressionWithLBFGS.train(sc.parallelize(parsed_data))
 model.threshold
 # 0.5
 model.predict(parsed_data[2].features)
 # 1

 model.clearThreshold()
 model.predict(parsed_data[2].features)
 # 0.9873840020002339

答案 1 :(得分:0)

我认为问题在于计算预测整个训练集的概率分数。如果是这样,我做了以下计算。不确定帖子是否仍然有效,但这就是我这样做的方式:

#get the original training data before it was converted to rows of LabelPoint.
#let us assume it is otd  ( of type spark DataFrame)
#let us extract the featureset as rdd by:
fs=otd.rdd.map(lambda x:x[1:]) # assuming label is col 0.

#the below is just a sample way of creating a Labelpoint rows..
parsedData= otd.rdd.map(lambda x: reg.LabeledPoint(int(x[0]-1),x[1:]))

# now convert otd to a panda DataFrame as:
ptd= otd.toPandas()
m= ptd.shape[0]
# train and get the model
model=LogisticRegressionWithLBFGS.train(trainingData,numClasses=10)


#Now store the model.predict rdd structures 
predict=model.predict(fs)
pr=predict.collect()

correct=0
correct = ((ptd.label-1) == (pr)).sum()
print((correct/m) *100)

注意以上是针对多级分类的。