Question

我正在以批处理模式使用Apache Spark。我已经建立了一个完整的管道，将文本转换为TFIDF向量，然后使用Logistic回归预测布尔类：

# Chain previously created feature transformers, indexers and regression in a Pipeline
pipeline = Pipeline(stages=[tokenizer, hashingTF, idf, 
                        labelIndexer, featureIndexer, lr])
#Fit the full model to the training data
model = pipeline.fit(trainingData)

#Predict test data 
predictions = model.transform(testData)

我可以检查predictions，这是一个火花数据帧，它正是我所期望的。接下来，我想看一个混淆矩阵，所以我将分数和标签转换为RDD并将其传递给BinaryClassificationMetrics（）：

predictionAndLabels = predictions.select('prediction','label').rdd

最后，我将其传递给BinaryClassificationMetrics：

metrics = BinaryClassificationMetrics(predictionAndLabels) #this errors out

这是错误：

AttributeError: 'SparkSession' object has no attribute 'serializer'

此错误没有帮助，对其进行搜索会引发一系列广泛的问题。我发现唯一看起来相似的是这篇没有答案的帖子：How to resolve error "AttributeError: 'SparkSession' object has no attribute 'serializer'?

感谢您的协助！

Answer 1

For prosperity's sake, here's what I did to fix this. When I initiate the Spark Session and the SQL context, I was doing this, which is not right:

sc = SparkSession.builder.appName('App Name').master("local[*]").getOrCreate()
sqlContext = SQLContext(sc)

This problem was resolved by doing this instead:

sc = SparkSession.builder.appName('App Name').master("local[*]").getOrCreate()
sqlContext = SQLContext(sparkContext=sc.sparkContext, sparkSession=sc)

I'm not sure why that needed to be explicit, and would welcome clarification from the community if someone knows.

在Pyspark中评估分类器时，“ SparkSession”对象没有属性“ serializer”

1 个答案: