火花条件替换但保留归档值

时间:2016-11-17 14:43:12

标签: apache-spark apache-spark-sql spark-dataframe

我想有条件地在spark中填充nan值(以确保我考虑了我的数据的每个角落情况,而不是简单地用任何替换值填充任何内容)。

示例可能看起来像

case class FooBar(foo:String, bar:String)
val myDf = Seq(("a","first"),("b","second"),("c",null), ("third","fooBar"), ("someMore","null"))
         .toDF("foo","bar")
         .as[FooBar]

+--------+------+
|     foo|   bar|
+--------+------+
|       a| first|
|       b|second|
|       c|  null|
|   third|fooBar|
|someMore|  null|
+--------+------+

不幸的是

    myDf
        .withColumn(
          "bar",
          when(
            (($"foo" === "c") and ($"bar" isNull)) , "someReplacement" 
          )
        ).show

重置列

中的所有常规其他值
+--------+---------------+
|     foo|            bar|
+--------+---------------+
|       a|           null|
|       b|           null|
|       c|someReplacement|
|   third|           null|
|someMore|           null|
+--------+---------------+

myDf
    .withColumn(
      "bar",
      when(
        (($"foo" === "c") and ($"bar" isNull)) or
        (($"foo" === "someMore") and ($"bar" isNull)), "someReplacement" 
      )
    ).show

我真的想用它来填写foo的不同类/类别的值。不起作用。

我很好奇如何解决这个问题。

1 个答案:

答案 0 :(得分:5)

使用otherwise

when(
  (($"foo" === "c") and ($"bar" isNull)) or
  (($"foo" === "someMore") and ($"bar" isNull)), "someReplacement" 
).otherwise($"bar")

coalesce

coalesce(
  $"bar",  
  when(($"foo" === "c") or ($"foo" === "someMore"), "someReplacement")
)

coalesce的原因是...减少输入(因此您不会重复$"bar" isNull)。