在成长时间窗口中计算新的独特价值

时间:2016-08-22 20:08:00

标签: sql sql-server tsql

我有一个庞大的用户表(作为guid),一些关联的值,以及插入每行的时间戳。用户可能与此表中的许多行相关联。

guid | <other columns> | insertdate

我想计算每个月:插入了多少个唯一的新用户。手动操作很容易:

select count(distinct guid)
from table
where insertdate >= '20060201' and insertdate < '20060301'
and guid not in (select guid from table where
                      insertdate >= '20060101' and insertdate < '20060201')

如何在sql中连续每个月完成一次?

我想用排名功能清楚地将每个guid与一个月联系起来:

select guid,
,dense_rank() over ( order by datepart(YYYY, insertdate),
    datepart(m, t.TransactionDateTime)) as MonthRank
from table

然后迭代每个等级值:

declare @no_times int
declare @counter int = 1
set @no_times = select count(distinct concat(datepart(year, t.TransactionDateTime),
     datepart(month, t.TransactionDateTime))) from table
while @no_times > 0 do
(
select count(*), @counter
where guid not in (select guid from table where rank = @counter)
and rank = @int + 1
@counter += 1
@no_times -= 1
union all
)
end

我知道这种策略可能是错误的做事方式。

理想情况下,我希望结果集看起来像这样:

MonthRank | NoNewUsers

如果一个sql向导能指出我正确的方向,我会非常感兴趣和感激。

3 个答案:

答案 0 :(得分:0)

SELECT
    DATEPART(year,t.insertdate) AS YearNum
    ,DATEPART(mm,t.insertdate) as MonthNum
    ,COUNT(DISTINCT guid) AS NoNewUsers
    ,DENSE_RANK() OVER (ORDER BY COUNT(DISTINCT t.guid) DESC) AS MonthRank
FROM
    table t
    LEFT JOIN table t2
    ON t.guid = t2.guid
    AND t.insertdate > t2.insertdate
WHERE
    t2.guid IS NULL
GROUP BY
    DATEPART(year,t.insertdate)
    ,DATEPART(mm,t.insertdate)

使用左联接来查看该表是否曾作为先前的插入日期存在,如果他们没有,那么就像通常那样使用聚合来计算它们。如果你想添加一个排名来查看哪个月的新用户数最多,那么你可以使用你的DENSE_RANK()函数,但因为你已经想要分组,你想要的不需要分区子句。

答案 1 :(得分:0)

如果您想要输入guid第一次时间,那么您的查询并不完全正常。您可以第一次使用两个聚合:

select year(first_insertdate), month(first_insertdate), count(*)
from (select t.guid, min(insertdate) as first_insertdate
      from t
      group by t.guid
     ) t
group by year(first_insertdate), month(first_insertdate)
order by year(first_insertdate), month(first_insertdate);

如果您希望每次跳过一个月时计算guid,那么您可以使用lag()

select year(insertdate), month(insertdate), count(*)
from (select t.*,
             lag(insertdate) over (partition by guid order by insertdate) as prev_insertdate
      from t
     ) t
where prev_insertdate is null or
      datediff(month, prev_insertdate, insertdate) >= 2
group by year(insertdate), month(insertdate)
order by year(insertdate), month(insertdate);

答案 2 :(得分:0)

我用可怕的while循环解决了它,然后一位朋友帮助我以另一种方式更有效地解决它。

循环版本:

--ranked by month
select t.TransactionID
,t.BuyerUserID
,concat(datepart(year, t.InsertDate), datepart(month, 
t.InsertDate)) MonthRankName
,dense_rank() over ( order by datepart(YYYY, t.InsertDate), 
datepart(m, t.InsertDate)) as MonthRank
into #ranked
from table t;

--iteratate
declare @counter int = 1
declare @no_times int 
select @no_times = count(distinct concat(datepart(year, t.InsertDate),
    datepart(month, t.InsertDate))) from table t;
select count(distinct r.guid) as NewUnique, r.Monthrank into #results
    from #ranked r
    where r.MonthRank = 1 group by r.MonthRank;
while @no_times > 1
begin
insert into #results
select count(distinct rt.guid) as NewUnique, @counter + 1 as MonthRank
from #ranked r
where rt.guid not in
(
select rt2.guid from #ranked rt2 
where rt2.MonthRank = @counter
)
and rt.MonthRank = @counter + 1
set @counter = @counter+1
set @no_times = @no_times-1
end

select * from #results r

结果运行得非常慢(正如您所料)

这个方法的结果是速度提高了10倍:

select t.guid,
cast (concat(datepart(year, min(t.InsertDate)),
case when datepart(month, min(t.InsertDate)) < 10 then 
'0'+cast( datepart(month, min(t.InsertDate)) as varchar(10))
else cast (datepart(month, min(t.InsertDate)) as varchar(10)) end
) as int) as MonthRankName

into #NewUnique
from table t
group by t.guid;

select count(1) as NewUniques, t.MonthRankName from #NewUnique t
group by t.MonthRankName
order by t.MonthRankName

只需识别每个guid出现的第一个月,然后计算每个月发生的数量。通过一些简单的方法可以很好地使YearMonth格式化(这似乎比格式([date],&#39; yyyyMM&#39;)更有效,但需要对此进行更多实验。