Spark SQL的累积不同计数

时间:2017-06-27 13:01:05

标签: sql apache-spark apache-spark-sql

使用Spark 1.6.2。

这里有数据:

day | visitorID
-------------
1   | A
1   | B
2   | A
2   | C
3   | A
4   | A

我想计算前一天有多少不同的访客(前一天+累积)(我不知道确切的用语,对不起)。

这应该给出:

day | visitors
--------------
 1  | 2 (A+B)
 2  | 3 (A+B+C)
 3  | 3 
 4  | 3
  • 尝试自我加入,但实在太慢了
  • 我确定窗口功能是我正在寻找的但是没有设法找到它:/

3 个答案:

答案 0 :(得分:2)

你应该可以这样做:

select day, max(visitors) as visitors
from (select day,
             count(distinct visitorId) over (order by day) as visitors
      from t
     ) d
group by day;

实际上,我认为更好的方法是仅在出现的第一天记录访问者:

select startday, sum(count(*)) over (order by startday) as visitors
from (select visitorId, min(day) as startday
      from t
      group by visitorId
     ) t
group by startday
order by startday;

答案 1 :(得分:2)

在SQL中,你可以这样做。

select t1.day,sum(max(t.cnt)) over(order by t1.day) as visitors
from tbl t1
left join (select minday,count(*) as cnt 
           from (select visitorID,min(day) as minday 
                 from tbl 
                 group by visitorID
                ) t 
           group by minday
          ) t 
on t1.day=t.minday
group by t1.day
  • 使用min获取访客ID的第一天。
  • 计算上面发现的每个心灵的行数。
  • 左边将此连接到原始表并获得累积总和。

另一种方法是

select t1.day,sum(count(t.visitorid)) over(order by t1.day) as cnt 
from tbl t1
left join (select visitorID,min(day) as minday 
           from tbl 
           group by visitorID
          ) t 
on t1.day=t.minday and t.visitorid=t1.visitorid
group by t1.day

答案 2 :(得分:0)

试试吧

redis.conf