在Pyspark中评估分类器时,“ SparkSession”对象没有属性“ serializer”

时间:2019-02-14 04:08:20

标签: python apache-spark pyspark apache-spark-sql

我正在以批处理模式使用Apache Spark。我已经建立了一个完整的管道,将文本转换为TFIDF向量,然后使用Logistic回归预测布尔类:

# Chain previously created feature transformers, indexers and regression in a Pipeline
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, 
                        labelIndexer, featureIndexer, lr])
#Fit the full model to the training data
model = pipeline.fit(trainingData)

#Predict test data 
predictions = model.transform(testData)

我可以检查predictions,这是一个火花数据帧,它正是我所期望的。 接下来,我想看一个混淆矩阵,所以我将分数和标签转换为RDD并将其传递给BinaryClassificationMetrics():

predictionAndLabels = predictions.select('prediction','label').rdd

最后,我将其传递给BinaryClassificationMetrics:

metrics = BinaryClassificationMetrics(predictionAndLabels) #this errors out

这是错误:

AttributeError: 'SparkSession' object has no attribute 'serializer'

此错误没有帮助,对其进行搜索会引发一系列广泛的问题。我发现唯一看起来相似的是这篇没有答案的帖子:How to resolve error "AttributeError: 'SparkSession' object has no attribute 'serializer'?

感谢您的协助!

1 个答案:

答案 0 :(得分:1)

For prosperity's sake, here's what I did to fix this. When I initiate the Spark Session and the SQL context, I was doing this, which is not right:

sc = SparkSession.builder.appName('App Name').master("local[*]").getOrCreate()
sqlContext = SQLContext(sc)

This problem was resolved by doing this instead:

sc = SparkSession.builder.appName('App Name').master("local[*]").getOrCreate()
sqlContext = SQLContext(sparkContext=sc.sparkContext, sparkSession=sc)

I'm not sure why that needed to be explicit, and would welcome clarification from the community if someone knows.