SQL Server 2008中的CUME_DIST()相当于什么?

时间:2012-05-16 00:39:20

标签: sql sql-server sql-server-2008 tsql analytics

SQL Server 2012似乎引入了CUME_DIST()PERCENT_RANK,它们用于计算列的累积分布。在SQL Server 2008中是否有相同的功能来实现这一目标?

4 个答案:

答案 0 :(得分:3)

永远不要在SQL中说永远。

声明:

select percent_rank() over (partition by <x> order by <y>)

基本上相当于:

select row_number() over (partition by <x> order by <y>) / count(*) over (partition by <x>)

基本上意味着它在数据中没有重复项时有效。即使存在重复,它也应该足够接近。

“真正的”答案是它等同于:

select row_number() over (partition by <x> order by <y>) / count(distinct <y>) over (partition by <x>)

但是,我们没有计数(不同)作为一个函数。而且,除非你真的需要,否则在2008年表达是痛苦的。

函数cume_dist()更难,因为它需要一个累积总和,你需要一个自联接。假设没有重复的近似值:

with t as (select <x>, <y>,
                  row_number() over (partition by <x> order by <y>) as seqnum
           from <table>
          )
select t.*, sumy*1.0 / sum(sumy) over (partition by <x>)
from (select t.*, sum(tprev.y) as sumy
      from t left outer join
           t tprev
           on t.x = tprev.x and t.seqnum >= tprev.seqnum
     ) t

答案 1 :(得分:1)

2012年之前不存在等效函数,但一种可能的解决方法涉及递归CTE,至少对于数据集&lt; 32767行。在这里,一对骰子被抛出30次:

SET NOCOUNT ON;

DECLARE @t TABLE(i INT);
DECLARE @i INT=0;

WHILE @i<30 BEGIN
INSERT INTO @t VALUES (CAST(RAND()*6 AS INT)+1 + CAST(RAND()*6 AS INT)+1);
    SET @i+=1;
END

DECLARE @tc INT; SELECT @tc=COUNT(*) FROM @t;

WITH a AS (
    SELECT *
    , d=CAST(COUNT(1)OVER(PARTITION BY i ORDER BY i) AS DECIMAL(5,2)) / @tc
    , r=ROW_NUMBER()OVER(ORDER BY i)
    , pr=CAST((RANK()OVER(ORDER BY i)-1)AS DECIMAL(5,2)) / (@tc - 1)
    FROM @t
)
, rcte (i, d, r, cd, pr) AS (
    SELECT i, d, r, d, pr
    FROM a
    WHERE r=1

    UNION ALL

    SELECT a.i, a.d, a.r
    , CASE WHEN rcte.i<>a.i THEN CAST(rcte.cd+a.d AS DECIMAL(5,2)) ELSE rcte.cd END
    , a.pr
    FROM a
    INNER JOIN rcte ON rcte.r + 1 = a.r
)
SELECT i,cd,pr FROM rcte
OPTION (MAXRECURSION 32767)

结果:

i           cd                                      pr
----------- --------------------------------------- ---------------------------------------
2           0.0333333333333                         0.0000000000000
3           0.0700000000000                         0.0344827586206
4           0.2400000000000                         0.0689655172413
4           0.2400000000000                         0.0689655172413
4           0.2400000000000                         0.0689655172413
4           0.2400000000000                         0.0689655172413
4           0.2400000000000                         0.0689655172413
5           0.3100000000000                         0.2413793103448
5           0.3100000000000                         0.2413793103448
6           0.3800000000000                         0.3103448275862
6           0.3800000000000                         0.3103448275862
7           0.5100000000000                         0.3793103448275
7           0.5100000000000                         0.3793103448275
7           0.5100000000000                         0.3793103448275
7           0.5100000000000                         0.3793103448275
8           0.6100000000000                         0.5172413793103
8           0.6100000000000                         0.5172413793103
8           0.6100000000000                         0.5172413793103
9           0.8400000000000                         0.6206896551724
9           0.8400000000000                         0.6206896551724
9           0.8400000000000                         0.6206896551724
9           0.8400000000000                         0.6206896551724
9           0.8400000000000                         0.6206896551724
9           0.8400000000000                         0.6206896551724
9           0.8400000000000                         0.6206896551724
10          0.8700000000000                         0.8620689655172
11          0.9700000000000                         0.8965517241379
11          0.9700000000000                         0.8965517241379
11          0.9700000000000                         0.8965517241379
12          1.0000000000000                         1.0000000000000

以下是与上述CTE相当的SQL 2012:

SELECT *
, cd=CUME_DIST()OVER(ORDER BY i)
, pr=PERCENT_RANK()OVER(ORDER BY i)
FROM @t;

答案 2 :(得分:0)

这非常接近。首先是一些样本数据:

USE tempdb;
GO

CREATE TABLE dbo.DartScores
(
    TournamentID INT,
    PlayerID INT,
    Score INT
);

INSERT dbo.DartScores VALUES
(1, 1, 320),
(1, 2, 340),
(1, 3, 310),
(1, 4, 370),
(2, 1, 310),
(2, 2, 280),
(2, 3, 370),
(2, 4, 370);    

现在,2012版的查询:

SELECT TournamentID, PlayerID, Score, 
  pr = PERCENT_RANK() OVER (PARTITION BY TournamentID ORDER BY Score),
  cd = CUME_DIST()    OVER (PARTITION BY TournamentID ORDER BY Score)
FROM dbo.DartScores
ORDER BY TournamentID, pr;

产生这个结果:

TournamentID PlayerID Score pr                  cd
1            3        310   0                   0.25
1            1        320   0.333333333333333   0.5
1            2        340   0.666666666666667   0.75
1            4        370   1                   1
2            2        280   0                   0.25
2            1        310   0.333333333333333   0.5
2            3        370   0.666666666666667   1
2            4        370   0.666666666666667   1

2005年的等价物非常接近,但它并没有很好地处理关系。对不起,我今晚没气,否则我会帮忙弄清楚原因。我对Itzik's new High Performance window function book中学到的知识非常了解。

;WITH cte AS
(
    SELECT TournamentID, PlayerID, Score,
     rk = RANK()   OVER (PARTITION BY TournamentID ORDER BY Score),
     rn = COUNT(*) OVER (PARTITION BY TournamentID)
    FROM dbo.DartScores
)
SELECT TournamentID, PlayerID, Score,
  pr = 1e0*(rk-1)/(rn-1),
  cd = 1e0*(SELECT COALESCE(MIN(cte2.rk)-1, cte.rn)
    FROM cte AS cte2 WHERE cte2.rk > cte.rk) / rn
FROM cte;

产生此结果(注意cume_dist值如何稍微改变关系):

TournamentID PlayerID Score pr                  cd
1            3        310   0                   0.25
1            1        320   0.333333333333333   0.5
1            2        340   0.666666666666667   0.75
1            4        370   1                   1
2            2        280   0                   0.25
2            1        310   0.333333333333333   0.5
2            3        370   0.666666666666667   0.75
2            4        370   0.666666666666667   0.75

别忘了清理:

DROP TABLE dbo.DartScores;

答案 3 :(得分:0)

是的,有一个简单的解决方案,至少对于percent_rank()部分。你可以使用

(rank() over (partition by <x> order by <y>)-1)/(count(*) over (partition by <x>)-1)

这将为您提供与

完全相同的结果
percent_rank() over (partition by <x> order by <y>)

rank() - 函数是SQL Server 2008中已经存在的为数不多的分析函数之一。