我有一个包含4列的pyspark数据框。
id | name | age | job
----------------------------
我想在when子句中使用3列(array<string> type)
,并仅保留一列的值。
因此,我使用了when,但是出现了错误:
new_df = my_df.select("id","name","age","job").withColumn("coordinate", F.when(F.size(F.col("id")) > 0, my_df["id"]).when(F.size(F.col("name")) > 0, my_df["name"]).when(F.size(F.col("age")) > 0, my_df["age"]).otherwise("null"))
错误的小恢复:
AnalysisException: u"cannot resolve 'CASE WHEN (size(`id`) > 0) THEN `id` WHEN (size()...... name` WHEN (size() ..... age WHEN (size) ....
ELSE 'null' END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;
我该如何解决?谢谢
答案 0 :(得分:1)
在输入数组为空, null 或空数组
的情况下,这取决于您要在 coordinate 列中填充的内容import pyspark.sql.functions as F
df = sqlContext.createDataFrame([(['1','2'], ['a', 'b'], ['30', '40'], 'it'),([], [], [], 'it')], ['id', 'name', 'age', 'job'])
df.withColumn("coordinate", F.when(F.size("id") > 0, df["id"]).when(F.size("name") > 0, df["name"]).when(F.size("age") > 0, df["age"]).otherwise(None)).show()
+------+------+--------+---+----------+
| id| name| age|job|coordinate|
+------+------+--------+---+----------+
|[1, 2]|[a, b]|[30, 40]| it| [1, 2]|
| []| []| []| it| null|
+------+------+--------+---+----------+
df.withColumn("coordinate", F.when(F.size("id") > 0, df["id"]).when(F.size("name") > 0, df["name"]).when(F.size("age") > 0, df["age"]).otherwise(F.array(F.lit(None)))).show()
+------+------+--------+---+----------+
| id| name| age|job|coordinate|
+------+------+--------+---+----------+
|[1, 2]|[a, b]|[30, 40]| it| [1, 2]|
| []| []| []| it| []|
+------+------+--------+---+----------+