我想计算当前行和X前一行(滑动窗口)之间存在的不同端口号的数量,其中x可以是任何整数。
例如,
如果输入是:
ID PORT
1 21
2 22
3 23
4 25
5 25
6 21
输出应为:
ID PORT COUNT
1 21 1
2 22 2
3 23 3
4 25 4
5 25 4
6 21 4
我正在使用Hive,而不是RapidMiner,我尝试了以下内容:
select id, port,
count (*) over (partition by srcport order by id rows between 5 preceding and current row)
这必须适用于大数据,当X是大整数时。
任何反馈都将不胜感激。
答案 0 :(得分:0)
我认为没有一种简单的方法。一种方法使用lag()
:
select ( (case when port_5 is not null then 1 else 0 end) +
(case when port_4 is not null and port_4 not in (port_5) then 1 else 0 end) +
(case when port_3 is not null and port_3 not in (port_5, port_4) then 1 else 0 end) +
(case when port_2 is not null and port_2 not in (port_5, port_4, port_3) then 1 else 0 end) +
(case when port_1 is not null and port_1 not in (port_5, port_4, port_3, port_2) then 1 else 0 end) +
(case when port is not null and port not in (port_5, port_4, port_3, port_2, port_2) then 1 else 0 end)
) as cumulative_distinct_count
from (select t.*,
lag(port, 5) over (partition by srcport order by id rows) as port_5,
lag(port, 4) over (partition by srcport order by id rows) as port_4,
lag(port, 3) over (partition by srcport order by id rows) as port_3,
lag(port, 2) over (partition by srcport order by id rows) as port_2,
lag(port, 1) over (partition by srcport order by id rows) as port_1
from t
) t
这是一个复杂的查询,但性能应该没问题。
注意:port
和srcport
我认为是相同的,但这会从您的查询中借鉴。
答案 1 :(得分:0)
一种方法是使用self join
,因为窗口函数不支持distinct
。
select t1.id,count(distinct t2.port) as cnt
from tbl t1
join tbl t2 on t1.id-t2.id>=0 and t1.id-t2.id<=5 --change this number per requirements
group by t1.id
order by t1.id
这假设id是按顺序排列的。
如果没有,首先获取行号并使用上面的逻辑。这就像
with rownums as (select id,port,row_number() over(order by id) as rnum
from tbl)
select r1.id,count(distinct r2.port)
from rownums r1
join rownums r2 on r1.rnum-r2.rnum>=0 and r1.rnum-r2.rnum<=5
group by r1.id