Question

我有一个具有此架构的表：

create table mytable (creation_date timestamp,
                      value int,
                      category int);

我希望每个类别每小时最大值出现一次，仅在工作日。我已经取得了一些进展，我现在有这样的查询：

select category,foo.h as h,value, count(value) from mytable, (
      select date_trunc('hour', 
                        '2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo 
      where date_part('hour',creation_date) = date_part('hour',foo.h) and
            date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
      group by category,h,value;

结果我得到了这样的东西：

category |    h     |  value  | count 
---------+----------+---------+-------
       1 | 00:00:00 |       2 |     1
       1 | 01:00:00 |       2 |     1
       1 | 02:00:00 |       2 |     6
       1 | 03:00:00 |       2 |    31
       1 | 03:00:00 |       3 |    11
       1 | 04:00:00 |       2 |    21
       1 | 04:00:00 |       3 |     9
       1 | 13:00:00 |       1 |    14
       1 | 14:00:00 |       1 |    10
       1 | 14:00:00 |       2 |     7
       1 | 15:00:00 |       1 |    52

例如在04:00我必须得到值2和3，分别为21和9，我只需要具有最高计数值的值，这将是统计模式。

BTW我有超过2M的记录

Answer 1

这可以更简单：

SELECT DISTINCT ON (category, extract(hour FROM creation_date)::int)
       category
     , extract(hour FROM creation_date)::int AS h
     , count(*)::int AS max_ct
     , value
FROM   mytable
WHERE  extract(isodow FROM creation_date) < 6 -- no sat or sun
GROUP  BY 1,2,4
ORDER  BY 1,2,3 DESC;

基本上这些是步骤：

排除周末（WHERE ...）。使用ISODOW来简化表达。
从hour timestamp提取h。
按category，h和value分组。
计算三者组合的行数;转为integer - 我们不需要bigint。
按category，h和最高点数（DESC）排序。
仅根据(category, h)选择category的第一行（最高点数）。

我能够在一个查询级别执行此操作，因为在聚合函数之后应用了DISTINCT。

结果将为任何(category, h)保留 no 行，而根本没有任何条目。如果您需要填写空白LEFT JOIN，请执行以下操作：

SELECT c.category, h.h FROM cat_tbl c CROSS JOIN (SELECT generate_series(0, 23) AS h) h

Answer 2

考虑到你的表的大小，我很想使用你的查询来构建一个临时表，然后对其进行查询以最终确定结果。

假设您调用了临时表“summary_table”，以下查询应该这样做。

select
    category, h, value, count
from
    summary_table s1
where
    not exists 
    (select * from summary_table s2
        where s1.category = s2.category and
                  s1.h = s2.h and
                   (s1.count < s2.count 
            OR (s1.count = s2.count and s1.value > s2.value));

如果您不想创建表，可以使用WITH子句将查询附加到此表。

with summary_table as (
select category,foo.h as h,value, count(value) as count from mytable, (
      select date_trunc('hour', 
                        '2000-01-01 00:00:00'::timestamp+generate_series(0,23)*'1 hour'::interval)::time as h) AS foo 
      where date_part('hour',creation_date) = date_part('hour',foo.h) and
            date_part('dow',creation_date) > 0 and date_part('dow',creation_date) < 6
      group by category,h,value)
select
    category, h, value, count
from
    summary_table s1
where
    not exists 
    (select * from summary_table s2
        where s1.category = s1.category and
                  s1.h = s2.h and
                   (s1.count < s2.count 
            OR (s1.count = s2.count and s1.value > s2.value));

postgres的统计模式

2 个答案:

基本上这些是步骤：