Question

我有一个包含4列的pyspark数据框。

id  |  name  | age |  job
----------------------------

我想在when子句中使用3列(array<string> type)，并仅保留一列的值。因此，我使用了when，但是出现了错误：

new_df = my_df.select("id","name","age","job").withColumn("coordinate", F.when(F.size(F.col("id")) > 0, my_df["id"]).when(F.size(F.col("name")) > 0, my_df["name"]).when(F.size(F.col("age")) > 0, my_df["age"]).otherwise("null"))

错误的小恢复：

AnalysisException: u"cannot resolve 'CASE WHEN (size(`id`) > 0) THEN `id` WHEN (size()...... name` WHEN (size() ..... age WHEN (size) ....
    ELSE 'null' END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;

我该如何解决？谢谢

Answer 1

在输入数组为空， null 或空数组

的情况下，这取决于您要在 coordinate 列中填充的内容

import pyspark.sql.functions as F

df = sqlContext.createDataFrame([(['1','2'], ['a', 'b'], ['30', '40'], 'it'),([], [], [], 'it')], ['id', 'name', 'age', 'job'])

df.withColumn("coordinate", F.when(F.size("id") > 0, df["id"]).when(F.size("name") > 0, df["name"]).when(F.size("age") > 0, df["age"]).otherwise(None)).show()
+------+------+--------+---+----------+
|    id|  name|     age|job|coordinate|
+------+------+--------+---+----------+
|[1, 2]|[a, b]|[30, 40]| it|    [1, 2]|
|    []|    []|      []| it|      null|
+------+------+--------+---+----------+

df.withColumn("coordinate", F.when(F.size("id") > 0, df["id"]).when(F.size("name") > 0, df["name"]).when(F.size("age") > 0, df["age"]).otherwise(F.array(F.lit(None)))).show()
+------+------+--------+---+----------+
|    id|  name|     age|job|coordinate|
+------+------+--------+---+----------+
|[1, 2]|[a, b]|[30, 40]| it|    [1, 2]|
|    []|    []|      []| it|        []|
+------+------+--------+---+----------+

如何在Pyspark中组合多个WHEN

1 个答案: