对于pandas,我有一个这样的代码片段:
def setUnknownCatValueConditional(df, conditionCol, condition, colToSet, _valueToSet='KEINE'):
df.loc[(df[conditionCol] == condition) & (df[colToSet].isnull()), colToSet] = _valueToSet
有条件地替换数据框中的值。
尝试将此功能移植到spark
df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show
没有为我效劳
df.withColumn("A", when($"A" === "x" and $"B" isNull, "replacement")).show
warning: there was one feature warning; re-run with -feature for details
org.apache.spark.sql.AnalysisException: cannot resolve '((`A` = 'x') AND `B`)' due to data type mismatch: differing types in '((`A` = 'X') AND `B`)' (boolean and string).;;
即使df.printSchema返回A和b的字符串
这里有什么问题?
一个最小的例子:
import java.sql.{ Date, Timestamp }
case class FooBar(foo:Date, bar:String)
val myDf = Seq(("2016-01-01","first"),("2016-01-02","second"),("2016-wrongFormat","noValidFormat"), ("2016-01-04","lastAssumingSameDate"))
.toDF("foo","bar")
.withColumn("foo", 'foo.cast("Date"))
.as[FooBar]
myDf.printSchema
root
|-- foo: date (nullable = true)
|-- bar: string (nullable = true)
scala> myDf.show
+----------+--------------------+
| foo| bar|
+----------+--------------------+
|2016-01-01| first|
|2016-01-02| second|
| null| noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+
myDf.withColumn("foo", when($"bar" === "noValidFormat" and $"foo" isNull, "noValue")).show
预期输出
+----------+--------------------+
| foo| bar|
+----------+--------------------+
|2016-01-01| first|
|2016-01-02| second|
| "noValue"| noValidFormat|
|2016-01-04|lastAssumingSameDate|
+----------+--------------------+
如果需要链接条件
df
.withColumn("A",
when(
(($"B" === "x") and ($"B" isNull)) or
(($"B" === "y") and ($"B" isNull)), "replacement")
应该有效
答案 0 :(得分:3)
注意运营商优先权。它应该是:
myDf.withColumn("foo",
when(($"bar" === "noValidFormat") and ($"foo" isNull), "noValue"))
此:
$"bar" === "noValidFormat" and $"foo" isNull
评估为:
(($"bar" === "noValidFormat") and $"foo") isNull