Question

我想在逻辑回归模型中为我的数据帧添加一个预测列。功能如下：

def add_probability(df, model):

coefficients_broadcast = sc.broadcast(model.coefficients)
intercept = model.intercept

    def get_p(features):

    # Compute the raw value
        raw_prediction = coefficients_broadcast.value.dot(features)

    # Bound the raw value between 20 and -20
        if raw_prediction>20: raw_prediction=20
        if raw_prediction<-20: raw_prediction=-20
        print raw_prediction

    # Return the probability
        return (1+exp(-raw_prediction))^(-1)

    get_p_udf = udf(get_p, DoubleType())
    return df.withColumn('p', get_p_udf('features'))

在嵌套函数get_p中，它计算给定特征列表的观察概率。因此，在我定义函数后，我将其应用于我的训练数据帧。

add_probability_model_basic = lambda df: add_probability(df, lr_model_basic)
training_predictions = add_probability_model_basic(ohe_train_df).cache()

print training_predictions.first()

但是，当我试图查看第一行时，会出现以下错误：

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 98.0 failed 1 times, most recent failure: Lost task 0.0 in stage 98.0 (TID 270, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last)

如果我注释掉最后一个打印命令，似乎我的代码成功生成了training_predictions数据框。我很沮丧为什么它不能打印出第一行？

Answer 1

你绝对会喜欢它失败的原因：

return (1+exp(-raw_prediction))^(-1)

应该是

return (1+exp(-raw_prediction))**(-1)

很高兴我能帮忙

Spark工作中止了

1 个答案: