我有一个数据框,试图在数组列上执行窗口功能。
逻辑如下:按id
和filtered
列分组(或对窗口进行分区)。计算types
列为空的行的最高分数,否则取该行的分数。当分数不等于组的最高分数时,在列类型中添加“ NA”。
val data = spark.createDataFrame(Seq(
(1, "shirt for women", Seq("shirt", "women"), 19.1, "ST"),
(1, "shirt for women", Seq("shirt", "women"), 10.1, null),
(1, "shirt for women", Seq("shirt", "women"), 12.1, null),
(0, "shirt group women", Seq("group", "women"), 15.1, null),
(0, "shirt group women", Seq("group", "women"), 12.1, null),
(3, "shirt nmn women", Seq("shirt", "women"), 16.1, "ST"),
(3, "shirt were women", Seq("shirt", "women"), 13.1, "ST")
)).toDF("id", "raw", "filtered", "score", "types")
+---+-----------------+--------------+-----+-----+
|id |raw |filtered |score|types|
+---+-----------------+--------------+-----+-----+
|1 |shirt for women |[shirt, women]|19.1 |ST |
|1 |shirt for women |[shirt, women]|10.1 |null |
|1 |shirt for women |[shirt, women]|12.1 |null |
|0 |shirt group women|[group, women]|15.1 |null |
|0 |shirt group women|[group, women]|12.1 |null |
|3 |shirt nmn women |[shirt, women]|16.1 |ST |
|3 |shirt were women |[shirt, women]|13.1 |ST |
+---+-----------------+--------------+-----+-----+
预期输出:
+---+------------------+--------------+-----+----+
|id |raw |filtered |score|types|
+---+-----------------+--------------+-----+----+
|1 |shirt for women |[shirt, women]|19.1 |ST |
|1 |shirt for women |[shirt, women]|10.1 |NA |
|1 |shirt for women |[shirt, women]|12.1 |null|
|0 |shirt group women[women, group] |15.1 |null|
|0 |shirt group women|[women, group]|12.1 |NA |
|3 |shirt nmn women |[shirt, women]|16.1 |ST |
|3 |shirt were women |[shirt, women]|13.1 |ST |
+---+-----------------+--------------+-----+----+
我尝试过:
data.withColumn("max_score",
when(col("types").isNull,
max("score")
.over(Window.partitionBy("id", "filtered")))
.otherwise($"score"))
.withColumn("type_temp",
when(col("score") =!= col("max_score"),
addReasonsUDF(col("type"),
lit("NA")))
.otherwise(col("type")))
.drop("types", "max_score")
.withColumnRenamed("type_temp", "types")
但是它不起作用。这给了我
+---+-----------------+--------------+-----+---------+-----+
|id |raw |filtered |score|max_score|types|
+---+-----------------+--------------+-----+---------+-----+
|1 |shirt for women |[shirt, women]|19.1 |19.1 |ST |
|1 |shirt women |[shirt, women]|10.1 |19.1 |NA |
|1 |shirt of women |[shirt, women]|12.1 |19.1 |NA |
|0 |shirt group women|[group, women]|15.1 |15.1 |null |
|0 |shirt will women |[group, women]|12.1 |15.1 |NA |
|3 |shirt nmn women |[shirt, women]|16.1 |16.1 |ST |
|3 |shirt were women |[shirt, women]|13.1 |13.1 |ST |
+---+-----------------+--------------+-----+---------+-----+
有人可以告诉我我在做什么错吗?
当我尝试对id
和raw
进行分区时,我的窗口函数出了点问题,它也不起作用。因此,字符串分区和数组分区均无法正常工作。
dataSet.withColumn("max_score",
when(col("types").isNull,
max("score").over(Window.partitionBy("id", "raw")))
.otherwise($"score")).show(false)
+---+-----------------+--------------+-----+-----+---------+
|id |raw |filtered |score|types|max_score|
+---+-----------------+--------------+-----+-----+---------+
|3 |shirt nmn women |[shirt, women]|16.1 |ST |16.1 |
|0 |shirt group women|[group, women]|15.1 |null |15.1 |
|0 |shirt group women|[group, women]|12.1 |null |15.1 |
|3 |shirt were women |[shirt, women]|13.1 |ST |13.1 |
|1 |shirt for women |[shirt, women]|19.1 |ST |19.1 |
|1 |shirt for women |[shirt, women]|10.1 |null |19.1 |
|1 |shirt for women |[shirt, women]|12.1 |null |19.1 |
+---+-----------------+--------------+-----+-----+---------+
答案 0 :(得分:2)
您不需要在when
表达式中包含window函数,而是可以分两个阶段完成。首先根据id
,filtered
和types
列将最高分作为新列添加到每个组。这将特别为types
为null
的组提供最高分数。为此,最好使用窗口表达式,因为应保留其他列。
此后,只要when
具有otherwise
,就可以对types
/ types
进行检查以更改null
列的值值,且最高分数不等于score
。
在代码中:
val w = Window.partitionBy("id", "filtered", "types")
val df = data.withColumn("max_score", max($"score").over(w))
.withColumn("types", when($"types".isNull && $"score" =!= $"max_score", "NA").otherwise($"types"))
.drop("max_score")