如何使用Spark SCALA为“多个”单个DF / DS列创建单个数组结构列

时间:2018-10-31 05:34:43

标签: apache-spark apache-spark-sql dataset

说我有两个表,order_tableroom_table

order_table

+----------+---------+
| order_id | info    |
+----------+---------+
| order1   | infos   |
+----------+---------+

room_table有很多列

+----------+---------+-----+
| order_id | room_id | ... | 
+----------+---------+-----+
| order1   | room1   | ... |
| order1   | room2   | ... |
+----------+---------+-----+

我想将select * from room_table group by order_id结果作为收集列表添加到order_table新列rooms中。

输出表应保留以下架构:

-order_id string,
-info string,
-room array<struct>
 --room_id string,
 --room_price int,
 --room_name string
 -- ....

1 个答案:

答案 0 :(得分:2)

    val df1 = Seq(("order_1", "order_1_info"),
              ("order_2", "order_2_info")).toDF("order_id", "info")
    val df2 = Seq(("order_1", "room_1", 100, "palace_1"),
              ("order_2", "room_2", 200, "palace_2"),
              ("order_1", "room_3", 100, "palace_3"),
              ("order_2", "room_8", 200, "palace_x"))
              .toDF("order_id", "room_id", "room_price", "room_name")
    val cols: Array[String] = df2.columns
    val df3 = df2.groupBy("order_id").agg(collect_list(struct(cols.head, cols.tail:_*)) as "room")
    val df4 = df1.join(df3, Seq("order_id"))
    df4.show()
    df4.printSchema()

在上面的代码片段中,我仅制作了一些示例数据框供使用。

输出:-

+--------+------------+--------------------+
|order_id|        info|                room|
+--------+------------+--------------------+
| order_1|order_1_info|[[order_1,room_1,...|
| order_2|order_2_info|[[order_2,room_2,...|
+--------+------------+--------------------+

模式:-

root
 |-- order_id: string (nullable = true)
 |-- info: string (nullable = true)
 |-- room: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- order_id: string (nullable = true)
 |    |    |-- room_id: string (nullable = true)
 |    |    |-- room_price: integer (nullable = false)
 |    |    |-- room_name: string (nullable = true)

我希望这会有所帮助