在Spark DataFrame中将空数组替换为null

时间:2019-11-18 00:40:20

标签: apache-spark-sql apache-spark-dataset

考虑如下数据框:

+---+----+--------+----+
| c1|  c2|      c3|  c4|
+---+----+--------+----+
|  x|  n1|    [m1]|  []|
|  y|  n3|[m2, m3]|[z3]|
|  x|  n2|      []|  []|
+---+----+--------+----+

我想用 null 替换空数组。

+---+----+--------+----+
| c1|  c2|      c3|  c4|
+---+----+--------+----+
|  x|  n1|    [m1]|null|
|  y|  n3|[m2, m3]|[z3]|
|  x|  n2|    null|null|
+---+----+--------+----+

实现上述目标的有效方法是什么?

1 个答案:

答案 0 :(得分:1)

您可以检查数组长度并返回null usign when...otherwise函数:

val df = Seq(
        ("x", "n1", Seq("m1"), Seq()),
        ("y", "n3", Seq("m2", "m3"), Seq("z3")),
        ("x", "n2", Seq(), Seq())     
    ).toDF("c1", "c2", "c3", "c4")
df.show

df.select($"c1", $"c2", 
    when(size($"c3") > 0, $"c3").otherwise(lit(null)) as "c3",
    when(size($"c4") > 0, $"c4").otherwise(lit(null)) as "c4"
).show

它返回:

df: org.apache.spark.sql.DataFrame = [c1: string, c2: string ... 2 more fields]
+---+---+--------+----+
| c1| c2|      c3|  c4|
+---+---+--------+----+
|  x| n1|    [m1]|  []|
|  y| n3|[m2, m3]|[z3]|
|  x| n2|      []|  []|
+---+---+--------+----+
+---+---+--------+----+
| c1| c2|      c3|  c4|
+---+---+--------+----+
|  x| n1|    [m1]|null|
|  y| n3|[m2, m3]|[z3]|
|  x| n2|    null|null|
+---+---+--------+----+