Question

我正尝试在Databricks中使用Python pyspark在具有一个功能和一个标签的简单数据框上使用线性回归。但是，我遇到了阶段失败的一些问题。我已经审查了许多类似的问题，但是大多数问题都在Scala中，或者超出了我在此处所做的工作范围。

版本：

笔记本：Databricks 5.3（包括Apache Spark 2.4.0，Scala 2.11） Python版本：2

这是我所做的：

原始数据框如下所示：

    df_red = df_extra.select('cca3', 'class', 'device_id').groupby('cca3').pivot('class').count()

    display(df_red)

我希望将'mac'列作为标签，并将'other'列作为我的单个功能。

2。拖放列“ cca3”并创建标签/功能

features = ['other']
lr_data = df_red.drop('cca3').select(col('mac').alias('label'), *features)
display(lr_data)

创建矢量汇编器并删除数据帧中的空值

assembler = VectorAssembler(inputCols = features, outputCol = "features")
output = assembler.transform(lr_data)
new_lr_data = output.select("label", "features").where(col('label').isNotNull())
new_lr_data.show()

线性回归模型拟合：

# Fit the model
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
lrModel = lr.fit(new_lr_data)

# Print the coefficients and intercept for linear regression
print("Coefficients: %s" % str(lrModel.coefficients))
print("Intercept: %s" % str(lrModel.intercept))

# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary
#print("numIterations: %d" % trainingSummary.totalIterations)
#print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
#trainingSummary.residuals.show()
#print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
#print("r2: %f" % trainingSummary.r2)

这时我得到以下错误：

org.apache.spark.SparkException：由于阶段失败，作业被中止：阶段979.0中的任务73失败1次，最近一次失败：丢失的任务在979.0阶段为73.0（TID 32624，本地主机，执行程序驱动程序）：org.apache.spark.SparkException：无法执行用户定义函数（$ anonfun $ 4：（struct ）=> struct ，values：array >）

什么原因导致上述错误在Databricks中发生？难道是因为我只使用了一项功能而不是许多功能（通常是这种情况）？

非常感谢您的帮助！

Pyspark / Databricks-org.apache.spark.SparkException：由于阶段故障，作业中止了：

0 个答案: