我的df有很多列,但是我的问题是2列:
getName()
我想用对应于ID的类型替换“未知”类型。 结果应如下所示:
val df = Seq(("id1","unknown"),("id1","type1"),("id1","unknown"),("id2","typeX"),
("id2","typeX"),("id2","unknown"),("id5","typeY"),("id2","unknown"))
.toDF("ID","TYPE")
+---+-------+
| ID| TYPE|
+---+-------+
|id1|unknown|
|id1| type1|
|id1|unknown|
|id2| typeX|
|id2| typeX|
|id2|unknown|
|id5| typeY|
|id2|unknown|
+---+-------+
它不能用硬编码(用+---+-----+
| ID| TYPE|
+---+-----+
|id1|type1|
|id1|type1|
|id1|type1|
|id2|typeX|
|id2|typeX|
|id2|typeX|
|id5|typeY|
|id2|typeX|
+---+-----+
进行编码,因为我每周有30万个ID发生变化...
这是我已经尝试过的:
when id1 -> type1
这有效,
但是考虑到各种情况,它不会处理所有情况:
val w = Window.partitionBy("ID")
df.withColumn("TYPE",collect_list("TYPE").over(w))
+---+--------------------------------+
|ID |TYPE |
+---+--------------------------------+
|id5|[typeY] |
|id1|[unknown, type1, unknown] |
|id1|[unknown, type1, unknown] |
|id1|[unknown, type1, unknown] |
|id2|[typeX, typeX, unknown, unknown]|
|id2|[typeX, typeX, unknown, unknown]|
|id2|[typeX, typeX, unknown, unknown]|
|id2|[typeX, typeX, unknown, unknown]|
+---+--------------------------------+
df.withColumn("TYPE",typeProcessingUDF(col("TYPE")))
+---+-----+
| ID| TYPE|
+---+-----+
|id5|typeY|
|id1|type1|
|id1|type1|
|id1|type1|
|id2|typeX|
|id2|typeX|
|id2|typeX|
|id2|typeX|
+---+-----+
def dtypeProcessing(dtypeList : mutable.WrappedArray[String]) : String = {
val dtype = dtypeList
.filter(element => element!= "unknown" && element!="")
.distinct
dtype.length match {
case 0 => "Unknown"
case x if x >1 => "Unknown"
case x if x ==1 => dtype(0)
}
}
val typeProcessingUDF = udf(dtypeProcessing _)
if [type1,type2] => return "Unknown"
答案 0 :(得分:-1)
使用“ ID”作为窗口,而函数“ first”具有忽略的空值:
val idWindow = Window.partitionBy("ID")
val unknownToNull = when($"TYPE" === "unknown", null).otherwise($"TYPE")
val result = df.withColumn("TYPE",
coalesce(unknownToNull,
first(unknownToNull, ignoreNulls = true).over(idWindow)
)
)
输出:
+---+-----+
|ID |TYPE |
+---+-----+
|id1|type1|
|id1|type1|
|id1|type1|
|id2|typeX|
|id2|typeX|
|id2|typeX|
|id2|typeX|
|id5|typeY|
+---+-----+