SQL自连接和聚合

时间:2014-02-03 14:34:49

标签: sql postgresql

我在postgres中有一个具有以下结构的表

表路径:  乘客,起源,目的地,日期,月份,年份

我想根据一年内在一条路线上行驶的乘客数量找到前3条路线。 路线上的乘客总数(A< B)=乘客总数(A - > B)+乘客总数(B-> A)

汇总路线上的乘客数量的最佳/最佳方式是什么,表行数约为1.5亿行。

由于

3 个答案:

答案 0 :(得分:4)

这有两种方法。一个是聚合,另一个是连接。

select least(origin, dest) as od1, greatest(origin, dest) as od2, sum(passengers) as numpassengers
from path t
group by least(origin, dest), greatest(origin, dest)
order by numpassengers
limit 3;

另一个是自我加入。如果每个方向上只有一行,则可以不进行聚合而执行此操作:

select p1.origin, p1.dest, p1.passengers + p2.passengers as numpassengers
from path p1 join
     path pt2
     on p1.origin = p2.dest and p1.dest = p2.origin
where p1.origin < p1.dest
order by numpassengers desc
limit 3;

否则,您需要自联接和聚合,因此第一种方法可能更快:

select p1.origin, p1.dest, sum(p1.passengers + p2.passengers) as numpassengers
from path p1 join
     path pt2
     on p1.origin = p2.dest and p1.dest = p2.origin
where p1.origin < p1.dest
group by p1.origin, p1.dest
order by numpassengers desc
limit 3;

我不知道哪个更有效率。但是,我怀疑前三条路线的总和将是,例如,每个方向的前100名。如果是这样,在numpassengers上建立一个索引,并尝试:

select least(origin, dest) as od1, greatest(origin, dest) as od2, sum(passengers) as numpassengers
from path t cross join
     (select min(passengers) as cutoff
      from (select distinct passengers
            from path
            order by passengers desc
            limit 100
           ) t
     ) minp
where numpassengers >= minp.cutoff
group by least(origin, dest), greatest(origin, dest)
order by numpassengers
limit 3;

截止值的计算应该只使用索引并大大减少查询其余部分的负载。

编辑:

如果您没有least()greatest(),请使用case语句:

select (case when origin < dest then origin else dest end) as od1,
       (case when origin < dest then dest else origin end)  as od2,
       sum(passengers) as numpassengers
from path t
group by 1, 2
order by numpassengers
limit 3;

您可以重复case中的group by语句。但Amazon Redshift允许您引用group by子句中的列别名或位置。

答案 1 :(得分:0)

如果每条路线都在两个方向上使用,那么应给出答案:

SELECT (x.passengers + y.passengers) as passenders_sum, x.origin, y.dest
FROM yourTable x
JOIN yourTable y
ON x.origin = y.dest AND x.dest = y.origin
ORDER BY passenders_sum DESC;

使用自己加入的origin和dest列上的索引不应该让您担心。我认为无法避免该比例的操作来获得所请求的结果。 如果您只想要前X行,则必须向该语句添加某种LIMIT。我没有postgres经验。

答案 2 :(得分:0)

我认为SebastianH说得对。作为一个小改进,您可以尝试以下假设postgressql支持SELECT TOP子句:

SELECT TOP 3
    FROM (SELECT (SUM(A.PASSENGERS + B.PASSENGERS), A.ORIGIN, A.DEST)
          FROM YOURTABLE A JOIN YOURTABLE B
            ON (A.ORIGIN = B.DEST AND A.DEST = B.ORIGIN)
          GROUP BY A.ORIGIN, A.DEST
         )