根据条件计算数据子集的聚合

时间:2021-03-27 20:24:19

标签: sql postgresql

我有一个数据库如下:

| company | timestamp  | value |
| ------- | ---------- | ----- |
| google  | 2020-09-01 | 5     |
| google  | 2020-08-01 | 4     |
| amazon  | 2020-09-02 | 3     |

如果有 >= 20 个数据点,我想计算过去一年内每家公司的平均 value。如果数据点少于 20 个,那么我想要整个时间段内的平均值。我知道我可以做两个单独的查询并获得每个场景的平均值。我想的问题是如何根据我的标准将它们合并回一个表中。

select company, avg(value) from my_db GROUP BY company;

select company, avg(value) from my_db
where timestamp > (CURRENT_DATE - INTERVAL '12 months')
GROUP BY company;

3 个答案:

答案 0 :(得分:1)

使用条件聚合:

select company, 
       case 
         when sum(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end) >= 20 then 
              avg(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end)
         else avg(value)
       end
from my_db 
group by company

如果用 20 个数据点表示每家公司在过去 12 个月中的 20 行,则:

select company, 
       case 
         when count(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end) >= 20 then 
              avg(case when timestamp > CURRENT_DATE - INTERVAL '12 months' then value end)
         else avg(value)
       end
from my_db 
group by company 

答案 1 :(得分:1)

您可以使用窗口函数来提供过滤信息:

select company, avg(value),
       (count(*) = cnt_this_year) as only_this_year
from (select t.*,
             count(*) filter (where date_trunc('year', datecol) = date_trunc('year', now()) over (partition by company) as cnt_this_year
      from t
     ) t
where cnt_this_year >= 20 and date_trunc('year', datecol) = date_trunc('year', now()) or
      cnt_this_year < 20
group by company;

第三列指定是否所有行都来自今年。通过在 where 子句中进行过滤,还可以轻松添加其他计算(例如 min()max() 等)。

答案 2 :(得分:1)

WITH last_year AS (
   SELECT company, avg(value), 'year' AS range  -- optional tag
   FROM   tbl      
   WHERE  timestamp >= now() - interval '1 year'
   GROUP  BY 1
   HAVING count(*) >= 20  -- 20+ rows in range
   )
SELECT company, avg(value), 'all' AS range
FROM   tbl
WHERE  NOT EXISTS (SELECT FROM last_year WHERE company = t.company)
GROUP  BY 1
UNION ALL TABLE last_year;

db<>fiddle here

(timestamp) 上的索引仅在您的表很大且可以存放多年时使用。

如果大多数公司的范围内有 20 多行,则 (company) 上的索引将用于第二个 SELECT 以检索少数异常值。