我有一个关于在嵌套JSON数组上进行聚合的问题。我有样本订单数据框或(显示为JSON),如下所示:
{
"orderId": "oi1",
"orderLines": [
{
"productId": "p1",
"quantity": 1,
"sequence": 1,
"totalPrice": {
"gross": 50,
"net": 40,
"tax": 10
}
},
{
"productId": "p2",
"quantity": 3,
"sequence": 2,
"totalPrice": {
"gross": 300,
"net": 240,
"tax": 60
}
}
]
}
如何使用Spark SQL对给定订单的所有行的数量求和'?
例如在这种情况下1 + 3 = 4
我想在下面写一下,但没有相似内置功能支持它会出现(除非我错过了它可能!)
SELECT
orderId,
sum_array(orderLines.quantity) as totalQuantityItems
FROM
orders
可能需要自定义UDF(Scala)吗?如果是这样/任何例子,这会是什么样的? 即使进一步进入嵌套,总计项目总和
SELECT
orderId,
sum_array(orderLines.totalPrice.net) as totalOrderNet
FROM
orders
答案 0 :(得分:2)
使用spark.read.json读取数据集。
val orders = spark.
read.
option("wholeFile", true).
json("orders.json").
as[(String, Seq[(String, Long, Long, (Long, Long, Long))])]
scala> orders.show(truncate = false)
+-------+--------------------------------------------+
|orderId|orderLines |
+-------+--------------------------------------------+
|oi1 |[[p1,1,1,[50,40,10]], [p2,3,2,[300,240,60]]]|
+-------+--------------------------------------------+
scala> orders.map { case (id, lines) => (id, lines.map(_._2).sum) }.toDF("id", "sum").show
+---+---+
| id|sum|
+---+---+
|oi1| 4|
+---+---+
你可以使它更漂亮"更漂亮"使用Scala进行理解。
val quantities = for {
o <- orders
id = o._1
quantity <- o._2
} yield (id, quantity._2)
val sumPerOrder = quantities.
toDF("id", "quantity"). // <-- back to DataFrames to have names
groupBy("id").
agg(sum("quantity") as "sum")
scala> sumPerOrder.show
+---+---+
| id|sum|
+---+---+
|oi1| 4|
+---+---+