面对查询设计问题,不确定我对问题的处理是否不必要地复杂化:
我有一张事实表:
Column | Type | Modifiers
------------+-----------------------------+-------------------------------------------------------
id | integer | not null default nextval('messages_id_seq'::regclass)
type | character varying(255) |
ts | numeric |
text | text |
score | double precision |
user_id | integer |
channel_id | integer |
time_id | integer |
created_at | timestamp without time zone |
updated_at | timestamp without time zone |
我正在针对它运行一些分析性查询,其中一个(例如)将:
with intervals as (
select
(select '09/27/2014'::date) + (n || ' minutes')::interval start_time,
(select '09/27/2014'::date) + ((n+60) || ' minutes')::interval end_time
from generate_series(0, (24*60*7), 60 * 4) n
)
select
extract(epoch from i.start_time)::numeric * 1000 as ts,
extract(epoch from i.end_time)::numeric * 1000 as end_ts,
sum(avg(messages.score)) over (order by i.start_time) as score
from messages
right join intervals i
on messages.timestamp >= i.start_time and messages.timestamp < i.end_time
where messages.timestamp between '09/27/2014' and '10/04/2014'
group by i.start_time, i.end_time
order by i.start_time
正如大家们可能知道的那样 - 这个查询计算给定时间段分布的消息的“得分”属性的平均值,然后计算桶中的累积量(使用窗口)。
我接下来要做的是找到最接近每个存储桶平均值的前5个(例如)messages.text
。
现在,我唯一的计划是:
1) Join messages with the time-buckets
2) Compute a score - avg(score) over (partition by start_time) as deviation and save it against each record of the joined relation
3) Compute a rank() over (order by deviation) as rank
4) Select where rank between 1 and 5
我之所以放下这个原因是因为我第一次试图设计一个涉及在窗口函数(rank() over (partition by start_time, order by score - avg(score) over (partition by start_time))
中使用窗口函数的设计,我甚至都没想去看看如果能起作用的话。
关于我是否朝着正确的方向前进,我能得到一些建议吗?
答案 0 :(得分:0)
你应该前进的方向(这只是我的建议):
MINUS
(row score, avg(score))
操作
醇>
-- This will leave you with values also positive and negative
abs()
对步骤2中的每个操作进行相同的计算rank()
并按顺序对其进行排序WHERE rank BETWEEN 1 AND 5
答案 1 :(得分:0)
Whelp - 这就是我所拥有的并且似乎在起作用:
现在批评的是我的查询中的结构,性能优化和冗余! ^ _ ^(减去直接生成时间序列,而不是最终将修复的所有扭曲间隔数学!)
with intervals as (
select
(select '09/29/2014'::date) + (n || ' minutes')::interval start_time,
(select '09/29/2014'::date) + ((n+60) || ' minutes')::interval end_time
from generate_series(0, (24*60*7), 60 * 4) n
), intervaled_messages as (
select
extract(epoch from i.start_time)::numeric * 1000 as ts,
extract(epoch from i.end_time)::numeric * 1000 as end_ts,
abs(score - avg(score) over (partition by i.start_time)) as deviation
from messages
right join intervals i
on messages.timestamp >= i.start_time and messages.timestamp < i.end_time
where messages.timestamp between '09/29/2014' and '10/06/2014'
), ranked_messages as (
select ts, end_ts, deviation,
rank() over (partition by ts order by deviation) as rank,
row_number() over (partition by ts order by deviation) as row_number
from intervaled_messages
)
select ts, end_ts, deviation, rank
from ranked_messages
where rank between 1 and 5
and row_number between 1 and 5
order by ts;