Performant聚合where子句

时间:2011-03-08 22:17:02

标签: sql postgresql

我正在尝试查找自上次创建用户以来的三个月内创建的用户数。全部按州分组。

这是一个有效的查询:

select count(u.id) as numberOfUsers,
s.state
from users u
join states s on u.state_id = s.id
where u.creationdate > (
select max(u2.creationdate)
from users u2
where u2.state_id = s.id
) - interval '3 months'
group by s.state

然而,它需要100秒。有人能给我一个更高效的吗?

我希望这有效:

select count(u.id) as numberOfUsers,
s.state, max(u2.creationdate) as lastCreated
from users u
join states s on u.state_id = s.id
where u.creationdate > lastCreated - interval '3 months'
group by s.state

4 个答案:

答案 0 :(得分:3)

由于只进行一次扫描,这可能会表现得更好:

select count(*) as numberofusers,
       state
from ( select id, state_id, creationdate,
              max(creationdate) over (partition by state_id) - '3 months'::interval as cutoff
       from users
     ) x
     join states on states.id = x.state_id
where creationdate > cutoff
group by state

然而,在初始窗口聚合时,它会咀嚼很多工作记忆。

嗯,也许更像是:

with cutoffs as (
  select id, state,
         (select max(creationdate)
          from users
          where users.state_id = states.id) - '3 months'::interval as cutoff
  from states)
select count(*) as numberofusers, state
from users
     join cutoffs on users.state_id = cutoffs.id
where users.creationdate > cutoff
group by state

这是试图挑逗PostgreSQL进行正确的分区扫描,但它并不是很理想。它仍然进行全表扫描,但至少只有一个。迭代通过CTE输出并在循环内部发出外部查询结果的set-returns函数可能效果最好,因为它可以为每个状态使用creationdate索引。

答案 1 :(得分:2)

出于兴趣,以下查询如何执行?我对Postgresql如何处理最里面的查询(状态表+标量子查询)特别感兴趣。

用户必须有一个复合索引(state_id,creation_date)才能使用它。

select s2.id
      ,s2.state
      ,(select count(*) 
          from users u 
         where u.state_id     = s2.id
           and u.creationdate > s2.max_date) as numberOfUsers
  from (select s.id
              ,s.state
              ,(select max(u.creationdate) - interval '3 months'
                  from users u
                 where u.state_id = s.id) as max_date
         from states s
       ) s2;

编辑这是为该查询生成的计划,包含针对3个州的100,000个用户行:

 Seq Scan on states s (actual time=4.033..13.949 rows=3 loops=1)
   Buffers: shared hit=1743
   SubPlan 3
     ->  Aggregate (actual time=4.636..4.636 rows=1 loops=3)
           Buffers: shared hit=1742
           InitPlan 2 (returns $2)
             ->  Result (actual time=0.028..0.028 rows=1 loops=3)
                   Buffers: shared hit=12
                   InitPlan 1 (returns $1)
                     ->  Limit (actual time=0.022..0.022 rows=1 loops=3)
                           Buffers: shared hit=12
                           ->  Index Scan Backward using users_state_id_creationdate_idx on users u (actual time=0.019..0.019 rows=1 loops=3)
                                 Index Cond: ((state_id = $0) AND (creationdate IS NOT NULL))
                                 Buffers: shared hit=12
           ->  Bitmap Heap Scan on users u (actual time=1.095..3.693 rows=8425 loops=3)
                 Recheck Cond: ((state_id = $0) AND (creationdate > $2))
                 Buffers: shared hit=1730
                 ->  Bitmap Index Scan on users_state_id_creationdate_idx (actual time=1.017..1.017 rows=8425 loops=3)
                       Index Cond: ((state_id = $0) AND (creationdate > $2))
                       Buffers: shared hit=107
 Total runtime: 14.017 ms

答案 2 :(得分:1)

这是我用来将时间减少到82毫秒的查询:

with cutoffs as (
  select max(u.creationdate) as cuttoff, s.id, s.state,
          from users u
  join states s on u.state_id = s.id
group by s.state, s.id)
select count(*) as numberofusers, state
from users
     join cutoffs on users.state_id = cutoffs.id
where users.creationdate > cutoff
group by state

感谢araqnid。

答案 3 :(得分:0)

您确定查询的哪一部分很慢吗?你能添加索引吗?我不是Postgres大师,但我怀疑如果用户没有在users.creationdate上编入索引,MAX()函数将不得不进行全表扫描。嗯,无论如何它可能要做一个......

那就说,什么都没有!

SELECT u.numUsers, s.state FROM 
(SELECT count(id) as numUsers, state_id 
 FROM users
 WHERE creationdate > (MAX(creationdate) - interval '3 Months'
 GROUP BY state_id) u 
left join states s on u.state_id = s.state_id