我有一个名为“ Sold_Items”的表,如下所示。 而且我想使用Spark SQL来获取每个参与者的净销售量。
Item Buyer Seller Qty
----------------------------------
A JD Lidl 100
B SD JD 500
A Coop JD 125
C JD SD 300
中间表
Item Participant Buy Sell
--------------------------------------------
A JD 100 125
B JD 0 500
C JD 300 0
A Coop 125 0
A Lidl 0 100
B SD 500 0
C SD 0 300
最终结果应如下所示。
Item Participant Net Sell
----------------------------------
A JD 25
B JD 500
C JD -300
A Coop -125
A Lidl 100
B SD -500
C SD 300
在下面的两个查询中,有第一个表的买卖双方。
购买:
SELECT Item, Buyer, sum(qty) as buy_qty from sold_items group by Item, Buyer
销售:
SELECT Item, Seller, sum(qty) as sell_qty from sold_items group by Item, Seller
我正在尝试获取中间表,以便可以使用该表来获得最终结果。但是我似乎无法加入这两个查询。 希望将上述两个查询结合起来以获得中间表。
答案 0 :(得分:1)
取消透视并重新聚合。使用union all
最简单:
select user, sum(buy_qty), sum(sell_qty)
from ((select buyer as user, sum(qty) as buy_qty, 0 as sell_qty
from sold_items
group by buyer
) union all
(select seller as user, 0, sum(qty)
from sold_items
group by seller
)
) bs
group by user;
请注意,实际上并不需要在子查询中进行聚合,因此这也将起作用:
select user, sum(buy_qty), sum(sell_qty)
from ((select buyer as user, qty as buy_qty, 0 as sell_qty
from sold_items
) union all
(select seller as user, 0, qty
from sold_items
)
) bs
group by user;
我希望多重聚合版本在大型数据集上具有更好的性能-尽管改进可能不会那么大。