火花组和序列化列表

时间:2018-04-24 20:46:07

标签: apache-spark serialization

我希望通过id聚合数据集(id,score,field1,field2,field3)并包含其他列,按分数排序到某种列表/列中,以便它们可以序列化到以下对象中。

collect_set只占用一列,所以我不确定如何将所有字段放入一个没有concat的列中。我还需要将列表列限制为前3个测试。生成的数据集看起来像:Integer id, Array(List)

id, [[score, field1, field2, field3], [score, field1, field2, field3], [score, field1, field2, field3]]

class Student {
    private int id;
    private List<Test> tests;
}

class Test {
    private int score;
    private String field1;
    private String field2;
    private String field3;
}

例如:

id1,99,"just","some","text"
id1,95,"just","more","text"
id1,75,"still","more","text"
id1,88,"yet","more","text"

会导致:

id1,[[99,"just","some","text"], [95,"just","more","text"], [88,"yet","more","text"]]

这与之前提出的问题有所不同,因为它涉及排序和限制输出,这就是为什么答案需要一个窗口功能而其他问题的答案没有。

1 个答案:

答案 0 :(得分:0)

您可以使用Window个功能和struct

val df = spark.createDataFrame(
    Seq((1, 99, "a"), (1, 95, "b"), (1, 75, "c"), (1, 88, "d"))
  ).toDF("id", "score", "field")

df.show
+---+-----+-----+
| id|score|field|
+---+-----+-----+
|  1|   99|    a|
|  1|   95|    b|
|  1|   75|    c|
|  1|   88|    d|
+---+-----+-----+

val w = Window.partitionBy("id").orderBy($"score".desc)

val res = df.withColumn("row", row_number().over(w))
  .filter($"row" <= 3)
  .groupBy("id")
  .agg(collect_list(struct("score", "field")).as("data"))

res.show(false)
+---+------------------------+                                                  
|id |data                    |
+---+------------------------+
|1  |[[99,a], [95,b], [88,d]]|
+---+------------------------+

res.printSchema
root
 |-- id: integer (nullable = false)
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- score: integer (nullable = false)
 |    |    |-- field: string (nullable = true)

注意:确认collect_listscore维护订单。如果不是(如果您关心订购),则需要创建一个udf来代表您的列表data

文档: