Scala Spark GroupBy通过汇总在排序列表时保持日期顺序

时间:2018-08-13 21:04:57

标签: scala apache-spark dataframe group-by aggregate

val df = Seq((1221, 1, "Boston", "9/22/18 14:00"), (1331, 1, "New York", "8/10/18 14:00"), (1442, 1, "Toronto", "10/15/19 14:00"), (2041, 2, "LA", "1/2/18 14:00"), (2001, 2,"San Fransisco", "5/20/18 15:00"), (3001, 3, "San Jose", "6/02/18 14:00"), (3121, 3, "Seattle", "9/12/18 16:00"), (34562, 3, "Utah", "12/12/18 14:00"), (3233, 3, "Boston", "8/31/18 14:00"), (4120, 4, "Miami", "1/01/18 14:00"), (4102, 4, "Cincinati", "7/21/19 14:00"), (4201, 4, "Washington", "5/10/18 23:00"), (4301, 4, "New Jersey", "3/27/18 15:00"), (4401, 4, "Raleigh", "11/14/18 14:00")).toDF("id", "group_id", "place", "date")

这是一个简单的df

|   id|group_id|        place|          date|  
+-----+--------+-------------+--------------+  
| 1221|       1|       Boston| 9/22/18 14:00|  
| 1331|       1|     New York| 8/10/18 14:00|    
| 1442|       1|      Toronto|10/15/19 14:00|  
| 2041|       2|           LA|  1/2/18 14:00|  
| 2001|       2|San Fransisco| 5/20/18 15:00|  
| 3001|       3|     San Jose| 6/02/18 14:00|  
| 3121|       3|      Seattle| 9/12/18 16:00|  
| 4562|       3|         Utah|12/12/18 14:00|  
| 3233|       3|       Boston| 8/31/18 14:00|  
| 4120|       4|        Miami| 1/01/18 14:00|  
| 4102|       4|    Cincinati| 7/21/19 14:00|  
| 4201|       4|   Washington| 5/10/18 23:00|  
| 4301|       4|   New Jersey| 3/27/18 15:00|  
| 4401|       4|      Raleigh|11/14/18 14:00|  
+-----+--------+-------------+--------------+  

我想按“ group_id”分组并按升序收集日期。 (最早的日期在前)。

所需的输出:

+--------+----+--------+--------------+----+-------------+--------------+----+----------+--------------+----+---------+--------------+
|group_id|id_1| venue_1|        date_1|id_2|      venue_2|        date_2|id_3|   venue_3|        date_3|id_4|  venue_4|        date_4|
+--------+----+--------+--------------+----+-------------+--------------+----+----------+--------------+----+---------+--------------+
|       1|1331|New York|08/10/18 14:00|1221|       Boston|09/22/18 14:00|1442|   Toronto|10/15/19 14:00|null|     null|          null|
|       3|3001|San Jose|06/02/18 14:00|3233|       Boston|08/31/18 14:00|3121|   Seattle|09/12/18 16:00|4562|     Utah|12/12/18 14:00|
|       4|4120|   Miami|01/01/18 14:00|4301|   New Jersey|03/27/18 15:00|4201|Washington|05/10/18 23:00|4102|Cincinati|07/21/19 14:00|
|       2|2041|      LA| 01/2/18 14:00|2001|San Fransisco|05/20/18 15:00|null|      null|          null|null|     null|          null|
+--------+----+--------+--------------+----+-------------+--------------+----+----------+--------------+----+---------+--------------+

我正在使用的代码:

//for sorting by date to preserve order
val df2 = df.repartition(col("group_id")).sortWithinPartitions("date")

val finalDF = df2.groupBy(df("group_id")).agg(collect_list(df("id")).alias("id_list"),collect_list(df("place")).alias("venue_name_list"),collect_list(df("date")).alias("date_list")).selectExpr("group_id","id_list[0] as id_1","venue_name_list[0] as venue_1","date_list[0] as date_1","id_list[1] as id_2","venue_name_list[1] as venue_2","date_list[1] as date_2","id_list[2] as id_3","venue_name_list[2] as venue_3","date_list[2] as date_3","id_list[3] as id_4","venue_name_list[3] as venue_4","date_list[3] as date_4")

但是输出是:

+--------+-----+-------+--------------+----+-------------+--------------+----+----------+-------------+----+----------+-------------+
|group_id| id_1|venue_1|        date_1|id_2|      venue_2|        date_2|id_3|   venue_3|       date_3|id_4|   venue_4|       date_4|
+--------+-----+-------+--------------+----+-------------+--------------+----+----------+-------------+----+----------+-------------+
|       1| 1442|Toronto|10/15/19 14:00|1331|     New York| 8/10/18 14:00|1221|    Boston|9/22/18 14:00|null|      null|         null|
|       3|34562|   Utah|12/12/18 14:00|3001|     San Jose| 6/02/18 14:00|3233|    Boston|8/31/18 14:00|3121|   Seattle|9/12/18 16:00|
|       4| 4120|  Miami| 1/01/18 14:00|4401|      Raleigh|11/14/18 14:00|4301|New Jersey|3/27/18 15:00|4201|Washington|5/10/18 23:00|
|       2| 2041|     LA|  1/2/18 14:00|2001|San Fransisco| 5/20/18 15:00|null|      null|         null|null|      null|         null|
+--------+-----+-------+--------------+----+-------------+--------------+----+----------+-------------+----+----------+-------------+

观察: 如果将日期的格式(而不是示例“ 9/22/18 14:00”)设置为“ 09/22/18 14:00”,则在单位数字月份日期之前添加“ 0”,并在单位日期之前添加零,该代码可以正常工作,即日期顺序得到正确维护。任何解决方案都欢迎!谢谢。

2 个答案:

答案 0 :(得分:0)

由于您已经知道按未格式化的StringType日期进行排序是问题的根源,因此,这是一种方法,它首先生成TimestampType日期,然后创建想要的StructType列列以“合适的字段顺序”进行排序:

val finalDF = df.
  withColumn("dateFormatted", to_timestamp($"date", "MM/dd/yy HH:mm")).
  groupBy($"group_id").agg(
    sort_array(collect_list(struct($"dateFormatted", $"id", $"place"))).as("sorted_arr")
  ).
  selectExpr(
    "group_id",
    "sorted_arr[0].id as id_1", "sorted_arr[0].place as venue_1", "sorted_arr[0].dateFormatted as date_1",
    "sorted_arr[1].id as id_2", "sorted_arr[1].place as venue_2", "sorted_arr[1].dateFormatted as date_2",
    "sorted_arr[2].id as id_3", "sorted_arr[2].place as venue_3", "sorted_arr[2].dateFormatted as date_3",
    "sorted_arr[3].id as id_4", "sorted_arr[3].place as venue_4", "sorted_arr[3].dateFormatted as date_4"
  )

finalDF.show
// +--------+----+--------+-------------------+----+-------------+-------------------+----+----------+-------------------+-----+-------+-------------------+
// |group_id|id_1| venue_1|             date_1|id_2|      venue_2|             date_2|id_3|   venue_3|             date_3| id_4|venue_4|             date_4|
// +--------+----+--------+-------------------+----+-------------+-------------------+----+----------+-------------------+-----+-------+-------------------+
// |       1|1331|New York|2018-08-10 14:00:00|1221|       Boston|2018-09-22 14:00:00|1442|   Toronto|2019-10-15 14:00:00| null|   null|               null|
// |       3|3001|San Jose|2018-06-02 14:00:00|3233|       Boston|2018-08-31 14:00:00|3121|   Seattle|2018-09-12 16:00:00|34562|   Utah|2018-12-12 14:00:00|
// |       4|4120|   Miami|2018-01-01 14:00:00|4301|   New Jersey|2018-03-27 15:00:00|4201|Washington|2018-05-10 23:00:00| 4401|Raleigh|2018-11-14 14:00:00|
// |       2|2041|      LA|2018-01-02 14:00:00|2001|San Fransisco|2018-05-20 15:00:00|null|      null|               null| null|   null|               null|
// +--------+----+--------+-------------------+----+-------------+-------------------+----+----------+-------------------+-----+-------+-------------------+

一些注意事项:

  1. 必须形成StructType列以确保将相应的列一起排序
  2. 将结构域dateFormatted放在首位,以便sort_array将按所需顺序对数组进行排序

答案 1 :(得分:0)

使用to_timestamp函数设置日期格式,并使用sort_array进行聚合排序,如下所示:

 import org.apache.spark.sql.functions.to_timestamp

  val df = Seq((1221, 1, "Boston", "9/22/18 14:00"), (1331, 1, "New York", "8/10/18 14:00"), (1442, 1, "Toronto", "10/15/19 14:00"), (2041, 2, "LA", "1/2/18 14:00"), (2001, 2, "San Fransisco", "5/20/18 15:00"), (3001, 3, "San Jose", "6/02/18 14:00"), (3121, 3, "Seattle", "9/12/18 16:00"), (34562, 3, "Utah", "12/12/18 14:00"), (3233, 3, "Boston", "8/31/18 14:00"), (4120, 4, "Miami", "1/01/18 14:00"), (4102, 4, "Cincinati", "7/21/19 14:00"), (4201, 4, "Washington", "5/10/18 23:00"), (4301, 4, "New Jersey", "3/27/18 15:00"), (4401, 4, "Raleigh", "11/14/18 14:00"))
    .toDF("id", "group_id", "place", "date")

  val df2 = df.withColumn("MyDate", to_timestamp($"date", "MM/dd/yyyy HH:mm"))

  val finalDF = df2.groupBy(df("group_id"))
    .agg(collect_list(df2("id")).alias("id_list"),
      collect_list(df2("place")).alias("venue_name_list"),
      sort_array(collect_list(df2("MyDate"))).alias("date_list")).
    selectExpr("group_id",
      "id_list[0] as id_1",
      "venue_name_list[0] as venue_1",
      "date_list[0] as date_1",
      "id_list[1] as id_2",
      "venue_name_list[1] as venue_2",
      "date_list[1] as date_2",
      "id_list[2] as id_3",
      "venue_name_list[2] as venue_3",
      "date_list[2] as date_3",
      "id_list[3] as id_4",
      "venue_name_list[3] as venue_4",
      "date_list[3] as date_4")


  finalDF.show()