我希望通过id聚合数据集(id,score,field1,field2,field3)并包含其他列,按分数排序到某种列表/列中,以便它们可以序列化到以下对象中。
collect_set
只占用一列,所以我不确定如何将所有字段放入一个没有concat的列中。我还需要将列表列限制为前3个测试。生成的数据集看起来像:Integer id, Array(List)
。
id, [[score, field1, field2, field3], [score, field1, field2, field3], [score, field1, field2, field3]]
class Student {
private int id;
private List<Test> tests;
}
class Test {
private int score;
private String field1;
private String field2;
private String field3;
}
例如:
id1,99,"just","some","text"
id1,95,"just","more","text"
id1,75,"still","more","text"
id1,88,"yet","more","text"
会导致:
id1,[[99,"just","some","text"], [95,"just","more","text"], [88,"yet","more","text"]]
这与之前提出的问题有所不同,因为它涉及排序和限制输出,这就是为什么答案需要一个窗口功能而其他问题的答案没有。
答案 0 :(得分:0)
您可以使用Window
个功能和struct
:
val df = spark.createDataFrame(
Seq((1, 99, "a"), (1, 95, "b"), (1, 75, "c"), (1, 88, "d"))
).toDF("id", "score", "field")
df.show
+---+-----+-----+
| id|score|field|
+---+-----+-----+
| 1| 99| a|
| 1| 95| b|
| 1| 75| c|
| 1| 88| d|
+---+-----+-----+
val w = Window.partitionBy("id").orderBy($"score".desc)
val res = df.withColumn("row", row_number().over(w))
.filter($"row" <= 3)
.groupBy("id")
.agg(collect_list(struct("score", "field")).as("data"))
res.show(false)
+---+------------------------+
|id |data |
+---+------------------------+
|1 |[[99,a], [95,b], [88,d]]|
+---+------------------------+
res.printSchema
root
|-- id: integer (nullable = false)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- score: integer (nullable = false)
| | |-- field: string (nullable = true)
注意:确认collect_list
按score
维护订单。如果不是(如果您关心订购),则需要创建一个udf
来代表您的列表data
文档: