Question

我有一个函数get_alerts，该函数返回两个String字段。为简单起见，让我们考虑一下该函数的固定输出：return "xxx", "yyy"（这是为了避免发布get_alerts的代码）。

然后我有了UDF函数，并在Spark DataFrame when-otherwise中使用了df表达式。

import pyspark.sql.functions as func

get_alerts_udf = func.udf(lambda c1, c2, c3:
       get_alerts(c1, c2, c3),
       StructType(
                    [
                        StructField('probability', StringType()),
                        StructField('level', StringType())
                    ]
       )
    )

df = df \
    .withColumn("val", func.when(func.col("is_inside") == 1, get_alerts_udf(1,2,3))
                            .otherwise(["0","0"])
                )

问题在于otherwise(["0","0"])与函数get_alerts_udf的输出类型不对应。

如何定义otherwise(["0","0"])对应于：

      StructType(
                    [
                        StructField('probability', StringType()),
                        StructField('level', StringType())
                    ]
       )

更新：

我仍然收到错误：

pyspark.sql.utils.AnalysisException: u"cannot resolve 'CASE WHEN (is_inside` = 1) THEN <lambda>(c1, c2, c3) ELSE named_struct('col1', '0', 'col2', '0') END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;`.

根据重复帖子的建议，我使用了otherwise(func.struct(func.lit("xxx"),func.lit("yyy")))。

如何在其他时候创建正确的输出？

0 个答案: