Question

我正在使用pyspark数据框。我有一些尝试将dataframe转换为rdd的代码，但是收到以下错误：

AttributeError：“ SparkSession”对象没有属性“ serializer”

可能是什么问题？

training, test = rescaledData.randomSplit([0.8, 0.2])
nb = NaiveBayes(smoothing=1.0, modelType="multinomial")
# Train a naive Bayes model.
model = nb.fit(rescaledData)

# Make prediction and test accuracy.
predictionAndLabel = test.rdd.map(lambda p: (model.predict(p.features), p.label))
accuracy = 1.0 * predictionAndLabel.filter(lambda pl: pl[0] == pl[1]).count() / test.count()
print('model accuracy {}'.format(accuracy))

有人对语句test.rdd为何导致错误有任何见解吗？数据框包含Row object of (label, features)。

谢谢

Answer 1

Apologies as I don't have enough rep to comment. The answer to this question may resolve this, as this pertains to the way the SQL context is initiated:

https://stackoverflow.com/a/54738984/8534357

When I initiate the Spark Session and the SQL context, I was doing this, which is not right:

sc = SparkSession.builder.appName('App Name').master("local[*]").getOrCreate()
sqlContext = SQLContext(sc)

This problem was resolved by doing this instead:

sc = SparkSession.builder.appName('App Name').master("local[*]").getOrCreate()
sqlContext = SQLContext(sparkContext=sc.sparkContext, sparkSession=sc)

如何解决错误“ AttributeError：'SparkSession'对象没有属性'serializer'？

1 个答案: