处理相当复杂的SQL语句,并且在跨用户聚合时不会获得最多的prop_list计数。以下是我的数据集示例:
user_id, term_id, time_stamp, prop_list
u100, t10, 7:00, (a,b,c)
u100, t10, 7:01, (a,b)
u100, t11, 7:01, (a,b)
u101, t10, 7:00, (a,b,c)
u101, t10, 7:01, (a)
u102, t10, 6:59, (a)
期望的输出:
term_id, term_id_distinct_count, prop_list
t10, 3, (a,b,c)
t11, 1, (a,b)
这是我目前的代码:
select
a.term_id,
count(distinct user_id) as term_id_distinct_count,
a.prop_list
from
(select
user_id, term_id,
prop_list,
row_number() over(partition by user_id, term_id order by time_stamp asc) as row_no
from
data_table
group ) a
where
a.row_no = 1;
请注意,当user_id有多个term_id时,我们只想使用先发生的那个,这就是我按时间戳asc排序的原因。
答案 0 :(得分:0)
大多数支持窗口函数的数据库都支持count(distinct)
作为窗口函数,因此您可以这样做:
select a.term_id, term_id_distinct_count, a.prop_list
from (select user_id, term_id, prop_list,
row_number() over (partition by term_id order by time_stamp asc) as seqnum,
count(distinct user_id) over (partition by term_id) as term_id_distinct_count
from data_table
) a
where seqnum = 1;