假设您有一个包含列,日期,组ID,X和Y的表。
CREATE TABLE #sample
(
[Date] DATETIME,
GroupID INT,
X FLOAT,
Y FLOAT
)
DECLARE @date DATETIME = getdate()
INSERT INTO #sample VALUES(@date, 1, 1,3)
INSERT INTO #sample VALUES(DATEADD(d, 1, @date), 1, 1,1)
INSERT INTO #sample VALUES(DATEADD(d, 2, @date), 1, 4,2)
INSERT INTO #sample VALUES(DATEADD(d, 3, @date), 1, 3,3)
INSERT INTO #sample VALUES(DATEADD(d, 4, @date), 1, 6,4)
INSERT INTO #sample VALUES(DATEADD(d, 5, @date), 1, 7,5)
INSERT INTO #sample VALUES(DATEADD(d, 6, @date), 1, 1,6)
并且您想要计算每个组的X和Y的相关性。目前我使用的CTE有点乱:
;WITH DataAvgStd
AS (SELECT GroupID,
AVG(X) AS XAvg,
AVG(Y) AS YAvg,
STDEV(X) AS XStdev,
STDEV(Y) AS YSTDev,
COUNT(*) AS SampleSize
FROM #sample
GROUP BY GroupID),
ExpectedVal
AS (SELECT s.GroupID,
SUM(( X - XAvg ) * ( Y - YAvg )) AS ExpectedValue
FROM #sample s
JOIN DataAvgStd das
ON s.GroupID = das.GroupID
GROUP BY s.GroupID)
SELECT das.GroupID,
ev.ExpectedValue / ( das.SampleSize - 1 ) / ( das.XStdev * das.YSTDev )
AS
Correlation
FROM DataAvgStd das
JOIN ExpectedVal ev
ON das.GroupID = ev.GroupID
DROP TABLE #sample
似乎应该有一种方法可以使用OVER和PARTITION一次性执行此操作而不需要任何子查询。理想情况下,TSQL会有一个函数,所以你可以写:
SELECT GroupID, CORR(X, Y) OVER(PARTITION BY GroupID)
FROM #sample
GROUP BY GroupID
答案 0 :(得分:9)
即使您使用over()
,使用此corellation公式也无法避免所有嵌套查询。问题是你不能在同一个查询中反复使用这两个组,也不能有嵌套的聚合函数,例如sum(x - avg(x))
。因此,在最佳情况下,根据您的数据,您需要至少保留with
。
您的代码看起来像那样
;WITH DataAvgStd
AS (SELECT GroupID,
STDEV(X) over(partition by GroupID) AS XStdev,
STDEV(Y) over(partition by GroupID) AS YSTDev,
COUNT(*) over(partition by GroupID) AS SampleSize,
( X - AVG(X) over(partition by GroupID)) * ( Y - AVG(Y) over(partition by GroupID)) AS ExpectedValue
FROM #sample s)
SELECT distinct GroupID,
SUM(ExpectedValue) over(partition by GroupID) / (SampleSize - 1 ) / ( XStdev * YSTDev ) AS Correlation
FROM DataAvgStd
另一种方法是使用等同公式进行相关,Wikipedia描述。
这可以写成
SELECT GroupID,
Correlation=(COUNT(*) * SUM(X * Y) - SUM(X) * SUM(Y)) /
(SQRT(COUNT(*) * SUM(X * X) - SUM(X) * SUM(x))
* SQRT(COUNT(*) * SUM(Y* Y) - SUM(Y) * SUM(Y)))
FROM #sample s
GROUP BY GroupID;
答案 1 :(得分:2)
Pearson相关系数有两种,一种用于样本,一种用于整个种群。这些都很简单,单通,我相信,两者的正确公式:
-- Methods for calculating the two Pearson correlation coefficients
SELECT
-- For Population
(avg(x * y) - avg(x) * avg(y)) /
(sqrt(avg(x * x) - avg(x) * avg(x)) * sqrt(avg(y * y) - avg(y) * avg(y)))
AS correlation_coefficient_population,
-- For Sample
(count(*) * sum(x * y) - sum(x) * sum(y)) /
(sqrt(count(*) * sum(x * x) - sum(x) * sum(x)) * sqrt(count(*) * sum(y * y) - sum(y) * sum(y)))
AS correlation_coefficient_sample
FROM (
-- The following generates a table of sample data containing two columns with a luke-warm and tweakable correlation
-- y = x for 0 thru 99, y = x - 100 for 100 thru 199, etc. Execute it as a stand-alone to see for yourself
-- x and y are CAST as DECIMAL to avoid integer math, you should definitely do the same
-- Try TOP 100 or less for full correlation (y = x for all cases), TOP 200 for a PCC of 0.5, TOP 300 for one near 0.33, etc.
-- The superfluous "+ 0" is where you could apply various offsets to see that they have no effect on the results
SELECT TOP 200
CAST(ROW_NUMBER() OVER (ORDER BY [object_id]) - 1 + 0 AS DECIMAL) AS x,
CAST((ROW_NUMBER() OVER (ORDER BY [object_id]) - 1) % 100 AS DECIMAL) AS y
FROM sys.all_objects
) AS a
正如我在评论中所指出的,您可以尝试使用TOP 100或更低的示例进行完全关联(对于所有情况,y = x); TOP 200产生的相关性非常接近0.5; TOP 300,约0.33;如果你愿意,有一个地方(“+ 0”)可以添加一个偏移;扰流警报,它没有任何影响。确保将值设置为DECIMAL - 整数数学可以显着影响这些计算。
答案 2 :(得分:1)
SQL在嵌套聚合或窗口函数方面有点好笑,因此需要CTE或派生表。
如果必须在数据库服务器上实现,并且您正在寻找比CTE更具可读性的东西,那么您唯一的选择就是使用CLR滚动自己的聚合。
这里有一个很好的教程http://www.sqlservercentral.com/articles/SQLCLR/71942/,用于构建类似的CLR聚合。