spark条件替换值

时间:2016-11-15 12:38:54

标签: apache-spark apache-spark-sql spark-dataframe

对于pandas,我有一个这样的代码片段:

def setUnknownCatValueConditional(df, conditionCol, condition, colToSet, _valueToSet='KEINE'):
    df.loc[(df[conditionCol] == condition) & (df[colToSet].isnull()), colToSet] = _valueToSet

有条件地替换数据框中的值。

尝试将此功能移植到spark

df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show

没有为我效劳

df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show
warning: there was one feature warning; re-run with -feature for details
org.apache.spark.sql.AnalysisException: cannot resolve '((`A` = 'x') AND `B`)' due to data type mismatch: differing types in '((`A` = 'X') AND `B`)' (boolean and string).;;

即使df.printSchema返回A和b的字符串

这里有什么问题?

修改

一个最小的例子:

import java.sql.{ Date, Timestamp }
case class FooBar(foo:Date, bar:String)
val myDf = Seq(("2016-01-01","first"),("2016-01-02","second"),("2016-wrongFormat","noValidFormat"), ("2016-01-04","lastAssumingSameDate"))
         .toDF("foo","bar")
         .withColumn("foo", 'foo.cast("Date"))
         .as[FooBar]

myDf.printSchema
root
 |-- foo: date (nullable = true)
 |-- bar: string (nullable = true)


scala> myDf.show
+----------+--------------------+
|       foo|                 bar|
+----------+--------------------+
|2016-01-01|               first|
|2016-01-02|              second|
|      null|       noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+

myDf.withColumn("foo", when($"bar" === "noValidFormat" and $"foo" isNull, "noValue")).show

预期输出

+----------+--------------------+
|       foo|                 bar|
+----------+--------------------+
|2016-01-01|               first|
|2016-01-02|              second|
| "noValue"|       noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+

EDIT2

如果需要链接条件

df
    .withColumn("A",
      when(
        (($"B" === "x") and ($"B" isNull)) or
        (($"B" === "y") and ($"B" isNull)), "replacement") 

应该有效

1 个答案:

答案 0 :(得分:3)

注意运营商优先权。它应该是:

myDf.withColumn("foo",
  when(($"bar" === "noValidFormat") and ($"foo" isNull), "noValue"))

此:

$"bar" === "noValidFormat" and $"foo" isNull

评估为:

(($"bar" === "noValidFormat") and $"foo") isNull