spark udf max 多列;类型错误:float() 参数必须是字符串或数字,而不是“行”

时间:2021-03-29 16:29:47

标签: apache-spark pyspark apache-spark-sql user-defined-functions

我正在尝试从列列表中获取最大值以及具有这些帖子中描述的最大值的列的名称 PySpark: compute row maximum of the subset of columns and add to an exisiting dataframe
how to get the name of column with maximum value in pyspark dataframe 我查看了许多帖子并尝试了多种选择,但都没有成功。

列对象不可调用TypeError: 'Column' object is not callable using WithColumn 并传递多列Pyspark: Pass multiple columns in UDF

加载到数据框的表中的列 Rule_Total_Score:double, Rule_No_Identifier_Score:double

rules = ['Rule_Total_Score', 'Rule_No_Identifier_Score']
df = spark.sql('select * from  table')

@f.udf(DoubleType())
def get_max_row_with_None(*cols):
    return float(max(x for x in cols if x is not None))

sdf = df.withColumn("max_rule", get_max_row_with_None(f.struct([df[col] for col in df.columns if col in rules])))

1 个答案:

答案 0 :(得分:1)

UDF 接受列列表而不是 struct 列,因此如果您传入列并删除 f.struct,它应该可以工作:

@f.udf(DoubleType())
def get_max_row_with_None(*cols):
    if all(x is None for x in cols):
        return None
    else:
        return float(max(x for x in cols if x is not None))

sdf = df.withColumn(
    "max_rule", 
    get_max_row_with_None(*[df[col] for col in df.columns if col in rules])
)