Question

在尝试从recallByThreshold返回的DataFrame中提取与最高召回值关联的阈值时，我遇到了令人困惑的PySpark错误。有趣的是，只有在群集模式下运行应用程序时才会出现这些错误。

training, testing = data.randomSplit([0.7, 0.3], seed=100)
train = training.coalesce(200)
test = testing.coalesce(100)
train.persist()
test.persist()
model = LogisticRegression(labelCol='label',
                           featuresCol='features',
                           weightCol='importance',
                           maxIter=30,
                           regParam=0.3,
                           elasticNetParam=0.2)
trained_model = model.fit(train)
threshold = trained_model.summary.recallByThreshold.rdd.max(key=lambda x: x["recall"])["threshold"]

最后一行代码生成AttributeError: 'NoneType' object has no attribute 'setCallSite'。进一步细分，当我尝试trained_model.summary.recallByThreshold.rdd时，我又得到另一个错误*** AttributeError: 'NoneType' object has no attribute 'sc'。

此问题与Spark (pyspark) having difficulty calling statistics methods on worker node有关，但在这种情况下，我根本无法收集DataFrame（产生相同的错误）。我在主节点上从IPython启动了我的应用程序，所以SparkContext不应该通过SparkSession（使用Spark版本2.1.0）提供吗？

模型统计信息的PySpark DataFrame无法收集或转换为RDD

0 个答案: