我在postgres中有一个具有以下结构的表
表路径: 乘客,起源,目的地,日期,月份,年份
我想根据一年内在一条路线上行驶的乘客数量找到前3条路线。 路线上的乘客总数(A< B)=乘客总数(A - > B)+乘客总数(B-> A)
汇总路线上的乘客数量的最佳/最佳方式是什么,表行数约为1.5亿行。
由于
答案 0 :(得分:4)
这有两种方法。一个是聚合,另一个是连接。
select least(origin, dest) as od1, greatest(origin, dest) as od2, sum(passengers) as numpassengers
from path t
group by least(origin, dest), greatest(origin, dest)
order by numpassengers
limit 3;
另一个是自我加入。如果每个方向上只有一行,则可以不进行聚合而执行此操作:
select p1.origin, p1.dest, p1.passengers + p2.passengers as numpassengers
from path p1 join
path pt2
on p1.origin = p2.dest and p1.dest = p2.origin
where p1.origin < p1.dest
order by numpassengers desc
limit 3;
否则,您需要自联接和聚合,因此第一种方法可能更快:
select p1.origin, p1.dest, sum(p1.passengers + p2.passengers) as numpassengers
from path p1 join
path pt2
on p1.origin = p2.dest and p1.dest = p2.origin
where p1.origin < p1.dest
group by p1.origin, p1.dest
order by numpassengers desc
limit 3;
我不知道哪个更有效率。但是,我怀疑前三条路线的总和将是,例如,每个方向的前100名。如果是这样,在numpassengers上建立一个索引,并尝试:
select least(origin, dest) as od1, greatest(origin, dest) as od2, sum(passengers) as numpassengers
from path t cross join
(select min(passengers) as cutoff
from (select distinct passengers
from path
order by passengers desc
limit 100
) t
) minp
where numpassengers >= minp.cutoff
group by least(origin, dest), greatest(origin, dest)
order by numpassengers
limit 3;
截止值的计算应该只使用索引并大大减少查询其余部分的负载。
编辑:
如果您没有least()
和greatest()
,请使用case
语句:
select (case when origin < dest then origin else dest end) as od1,
(case when origin < dest then dest else origin end) as od2,
sum(passengers) as numpassengers
from path t
group by 1, 2
order by numpassengers
limit 3;
您可以重复case
中的group by
语句。但Amazon Redshift允许您引用group by
子句中的列别名或位置。
答案 1 :(得分:0)
如果每条路线都在两个方向上使用,那么应给出答案:
SELECT (x.passengers + y.passengers) as passenders_sum, x.origin, y.dest
FROM yourTable x
JOIN yourTable y
ON x.origin = y.dest AND x.dest = y.origin
ORDER BY passenders_sum DESC;
使用自己加入的origin和dest列上的索引不应该让您担心。我认为无法避免该比例的操作来获得所请求的结果。
如果您只想要前X行,则必须向该语句添加某种LIMIT
。我没有postgres经验。
答案 2 :(得分:0)
我认为SebastianH说得对。作为一个小改进,您可以尝试以下假设postgressql支持SELECT TOP
子句:
SELECT TOP 3
FROM (SELECT (SUM(A.PASSENGERS + B.PASSENGERS), A.ORIGIN, A.DEST)
FROM YOURTABLE A JOIN YOURTABLE B
ON (A.ORIGIN = B.DEST AND A.DEST = B.ORIGIN)
GROUP BY A.ORIGIN, A.DEST
)