我想有条件地在spark中填充nan值(以确保我考虑了我的数据的每个角落情况,而不是简单地用任何替换值填充任何内容)。
示例可能看起来像
case class FooBar(foo:String, bar:String)
val myDf = Seq(("a","first"),("b","second"),("c",null), ("third","fooBar"), ("someMore","null"))
.toDF("foo","bar")
.as[FooBar]
+--------+------+
| foo| bar|
+--------+------+
| a| first|
| b|second|
| c| null|
| third|fooBar|
|someMore| null|
+--------+------+
不幸的是
myDf
.withColumn(
"bar",
when(
(($"foo" === "c") and ($"bar" isNull)) , "someReplacement"
)
).show
重置列
中的所有常规其他值+--------+---------------+
| foo| bar|
+--------+---------------+
| a| null|
| b| null|
| c|someReplacement|
| third| null|
|someMore| null|
+--------+---------------+
和
myDf
.withColumn(
"bar",
when(
(($"foo" === "c") and ($"bar" isNull)) or
(($"foo" === "someMore") and ($"bar" isNull)), "someReplacement"
)
).show
我真的想用它来填写foo的不同类/类别的值。不起作用。
我很好奇如何解决这个问题。
答案 0 :(得分:5)
使用otherwise
:
when(
(($"foo" === "c") and ($"bar" isNull)) or
(($"foo" === "someMore") and ($"bar" isNull)), "someReplacement"
).otherwise($"bar")
或coalesce
:
coalesce(
$"bar",
when(($"foo" === "c") or ($"foo" === "someMore"), "someReplacement")
)
coalesce
的原因是...减少输入(因此您不会重复$"bar" isNull
)。