Spark SQL连接来自同一表的两个结果

时间:2020-05-21 11:37:35

标签: sql apache-spark-sql

我有一个名为“ Sold_Items”的表,如下所示。 而且我想使用Spark SQL来获取每个参与者的净销售量。

Item     Buyer   Seller   Qty
----------------------------------
A        JD      Lidl     100
B        SD      JD       500
A        Coop    JD       125
C        JD      SD       300

中间表

Item     Participant      Buy         Sell
--------------------------------------------
A        JD                100        125 
B        JD                0          500
C        JD                300          0
A        Coop              125          0
A        Lidl                0        100
B        SD                500          0    
C        SD                  0        300

最终结果应如下所示。

Item     Participant      Net Sell
----------------------------------
A        JD                 25
B        JD                500
C        JD               -300
A        Coop             -125
A        Lidl              100
B        SD               -500  
C        SD                300

在下面的两个查询中,有第一个表的买卖双方。

购买:

SELECT Item, Buyer, sum(qty) as buy_qty from sold_items group by Item, Buyer

销售:

SELECT Item, Seller, sum(qty) as sell_qty from sold_items group by Item, Seller

我正在尝试获取中间表,以便可以使用该表来获得最终结果。但是我似乎无法加入这两个查询。 希望将上述两个查询结合起来以获得中间表。

1 个答案:

答案 0 :(得分:1)

取消透视并重新聚合。使用union all最简单:

select user, sum(buy_qty), sum(sell_qty)
from ((select buyer as user, sum(qty) as buy_qty, 0 as sell_qty
       from sold_items
       group by buyer
      ) union all
      (select seller as user, 0, sum(qty)
       from sold_items
       group by seller
      )
     ) bs
group by user;

请注意,实际上并不需要在子查询中进行聚合,因此这也将起作用:

select user, sum(buy_qty), sum(sell_qty)
from ((select buyer as user, qty as buy_qty, 0 as sell_qty
       from sold_items
      ) union all
      (select seller as user, 0, qty
       from sold_items
      )
     ) bs
group by user;

我希望多重聚合版本在大型数据集上具有更好的性能-尽管改进可能不会那么大。