我想在逻辑回归模型中为我的数据帧添加一个预测列。功能如下:
def add_probability(df, model):
coefficients_broadcast = sc.broadcast(model.coefficients)
intercept = model.intercept
def get_p(features):
# Compute the raw value
raw_prediction = coefficients_broadcast.value.dot(features)
# Bound the raw value between 20 and -20
if raw_prediction>20: raw_prediction=20
if raw_prediction<-20: raw_prediction=-20
print raw_prediction
# Return the probability
return (1+exp(-raw_prediction))^(-1)
get_p_udf = udf(get_p, DoubleType())
return df.withColumn('p', get_p_udf('features'))
在嵌套函数get_p中,它计算给定特征列表的观察概率。 因此,在我定义函数后,我将其应用于我的训练数据帧。
add_probability_model_basic = lambda df: add_probability(df, lr_model_basic)
training_predictions = add_probability_model_basic(ohe_train_df).cache()
print training_predictions.first()
但是,当我试图查看第一行时,会出现以下错误:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 98.0 failed 1 times, most recent failure: Lost task 0.0 in stage 98.0 (TID 270, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last)
如果我注释掉最后一个打印命令,似乎我的代码成功生成了training_predictions
数据框。我很沮丧为什么它不能打印出第一行?
答案 0 :(得分:1)
你绝对会喜欢它失败的原因:
return (1+exp(-raw_prediction))^(-1)
应该是
return (1+exp(-raw_prediction))**(-1)
很高兴我能帮忙