spark dataframe用另一行中的值替换值

时间:2019-06-11 12:26:46

标签: scala apache-spark apache-spark-sql

我的df有很多列,但是我的问题是2列:

getName()

我想用对应于ID的类型替换“未知”类型。 结果应如下所示:

val df = Seq(("id1","unknown"),("id1","type1"),("id1","unknown"),("id2","typeX"),
             ("id2","typeX"),("id2","unknown"),("id5","typeY"),("id2","unknown"))
    .toDF("ID","TYPE")
+---+-------+
| ID|   TYPE|
+---+-------+
|id1|unknown|
|id1|  type1|
|id1|unknown|
|id2|  typeX|
|id2|  typeX|
|id2|unknown|
|id5|  typeY|
|id2|unknown|
+---+-------+

它不能用硬编码(用+---+-----+ | ID| TYPE| +---+-----+ |id1|type1| |id1|type1| |id1|type1| |id2|typeX| |id2|typeX| |id2|typeX| |id5|typeY| |id2|typeX| +---+-----+ 进行编码,因为我每周有30万个ID发生变化...

这是我已经尝试过的:

when id1 -> type1

这有效,

但是考虑到各种情况,它不会处理所有情况:

  • val w = Window.partitionBy("ID") df.withColumn("TYPE",collect_list("TYPE").over(w)) +---+--------------------------------+ |ID |TYPE | +---+--------------------------------+ |id5|[typeY] | |id1|[unknown, type1, unknown] | |id1|[unknown, type1, unknown] | |id1|[unknown, type1, unknown] | |id2|[typeX, typeX, unknown, unknown]| |id2|[typeX, typeX, unknown, unknown]| |id2|[typeX, typeX, unknown, unknown]| |id2|[typeX, typeX, unknown, unknown]| +---+--------------------------------+ df.withColumn("TYPE",typeProcessingUDF(col("TYPE"))) +---+-----+ | ID| TYPE| +---+-----+ |id5|typeY| |id1|type1| |id1|type1| |id1|type1| |id2|typeX| |id2|typeX| |id2|typeX| |id2|typeX| +---+-----+ def dtypeProcessing(dtypeList : mutable.WrappedArray[String]) : String = { val dtype = dtypeList .filter(element => element!= "unknown" && element!="") .distinct dtype.length match { case 0 => "Unknown" case x if x >1 => "Unknown" case x if x ==1 => dtype(0) } } val typeProcessingUDF = udf(dtypeProcessing _)
  • if [type1,type2] => return "Unknown"

1 个答案:

答案 0 :(得分:-1)

使用“ ID”作为窗口,而函数“ first”具有忽略的空值:

val idWindow = Window.partitionBy("ID")
val unknownToNull = when($"TYPE" === "unknown", null).otherwise($"TYPE")
val result = df.withColumn("TYPE",
  coalesce(unknownToNull,
    first(unknownToNull, ignoreNulls = true).over(idWindow)
  )
)

输出:

+---+-----+
|ID |TYPE |
+---+-----+
|id1|type1|
|id1|type1|
|id1|type1|
|id2|typeX|
|id2|typeX|
|id2|typeX|
|id2|typeX|
|id5|typeY|
+---+-----+