Spark工作中止了

时间:2016-08-27 13:50:25

标签: python apache-spark

我想在逻辑回归模型中为我的数据帧添加一个预测列。功能如下:

def add_probability(df, model):

coefficients_broadcast = sc.broadcast(model.coefficients)
intercept = model.intercept

    def get_p(features):

    # Compute the raw value
        raw_prediction = coefficients_broadcast.value.dot(features)

    # Bound the raw value between 20 and -20
        if raw_prediction>20: raw_prediction=20
        if raw_prediction<-20: raw_prediction=-20
        print raw_prediction

    # Return the probability
        return (1+exp(-raw_prediction))^(-1)

    get_p_udf = udf(get_p, DoubleType())
    return df.withColumn('p', get_p_udf('features'))

在嵌套函数get_p中,它计算给定特征列表的观察概率。 因此,在我定义函数后,我将其应用于我的训练数据帧。

add_probability_model_basic = lambda df: add_probability(df, lr_model_basic)
training_predictions = add_probability_model_basic(ohe_train_df).cache()

print training_predictions.first()

但是,当我试图查看第一行时,会出现以下错误:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 98.0 failed 1 times, most recent failure: Lost task 0.0 in stage 98.0 (TID 270, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last)

如果我注释掉最后一个打印命令,似乎我的代码成功生成了training_predictions数据框。我很沮丧为什么它不能打印出第一行?

1 个答案:

答案 0 :(得分:1)

你绝对会喜欢它失败的原因:

return (1+exp(-raw_prediction))^(-1) 

应该是

return (1+exp(-raw_prediction))**(-1)

很高兴我能帮忙