Question

我将DF作为展平的订单行，其列为：

orderId (String), orderLine (struct)

1,  {"sequence":1,"productId":11111111,"productName":"Blah","quantity":1,"unitPrice":{"net":65},"totalPrice":{"gross":67.84,"net":65,"tax":2.84}}

1,  {"sequence":2,"productId":22222222,"productName":"Blah2","quantity":1,"unitPrice":{"net":100},"totalPrice":{"gross":104.38,"net":100,"tax":4.38}}

从中生成数据帧的最有效方法是：

orderId (string), orderLines (Array of orderLine Struct)

基本上将给定订单的各个行结构分组/收集到一个行项目数组中 - 在此示例中，orderLines将2个orderLine项目作为数组的一部分。

Answer 1

我使用groupBy和collect_list功能如下：

orders.groupBy("orderId").agg(collect_list("orderLine"))

请参阅Dataset（针对groupBy）和functions对象（针对collect_list函数）。

如何收集每组的订单行（使用collect_list）？

1 个答案: