单通道解决方案：

Question

我有两张桌子：

ID，YRMO，计数

1，2013年12月4日

1，2014年1月6日

1，2014年2月7日

2，一月，2014,6

2，二月，2014,8

ID，YRMO，计数

1，2013年12月10日

1，2014年1月8日

1，2014年3月12日

2，2014年1月6日

2，2014年2月10日

我想找到每组ID的皮尔逊核心系数。大约有200多种不同的IDS。

Pearson相关性是两个变量X和Y之间线性相关（依赖性）的度量，给出+1和-1之间的值

更多信息可以在这里找到：http://oreilly.com/catalog/transqlcook/chapter/ch08.html 计算相关部分

Answer 1

计算Pearson相关系数;您需要先计算Mean然后standard daviation，然后计算correlation coefficient，如下所示

1。计算平均值

insert into tab2 (tab1_id, mean)
select ID, sum([counts]) / 
(select count(*) from tab1) as mean
from tab1
group by ID;

2。计算标准差

update tab2
set stddev = (
select sqrt(
sum([counts] * [counts]) / 
(select count(*) from tab1)
- mean * mean
) stddev
from tab1
where tab1.ID = tab2.tab1_id
group by tab1.ID);

3。最后`Pearson Correlation Coefficient`

select ID,
((sf.sum1 / (select count(*) from tab1)
- stats1.mean * stats2.mean
)
/ (stats1.stddev * stats2.stddev)) as PCC
from (
select r1.ID,
 sum(r1.[counts] * r2.[counts]) as sum1
from tab1 r1
join tab1 r2
on r1.ID = r2.ID
group by r1.ID
) sf
join tab2 stats1
on stats1.tab1_id = sf.ID
join tab2 stats2
on stats2.tab1_id = sf.ID

您发布的数据中的结果

enter image description here

在此处查看演示小提琴http://sqlfiddle.com/#!3/0da20/5

修改

精炼一点。您可以使用以下函数获取PCC，但我得到的结果与您的结果完全相同，而是获得0.999996000000000的{{1}}。

这对您来说可能是一个很好的切入点。您可以从此处进一步细化计算。

ID = 1

调用函数

create function calculate_PCC(@id int) returns decimal(16,15) as begin declare @mean numeric(16,5); declare @stddev numeric(16,5); declare @count numeric(16,5); declare @pcc numeric(16,12); declare @store numeric(16,7); select @count = CONVERT(numeric(16,5), count(case when Id=@id then 1 end)) from tab1; select @mean = convert(numeric(16,5),sum([Counts])) / @count from tab1 WHERE ID = @id; select @store = (sum(counts * counts) / @count) from tab1 WHERE ID = @id; set @stddev = sqrt(@store - (@mean * @mean)); set @pcc = ((@store - (@mean * @mean)) / (@stddev * @stddev)); return @pcc; end

Answer 2

单通道解决方案：

Pearson相关系数有两种，一种用于样本，一种用于整个种群。这些都很简单，单通，我相信，两者的正确公式：

-- Methods for calculating the two Pearson correlation coefficients
SELECT  
        -- For Population
        (avg(x * y) - avg(x) * avg(y)) / 
        (sqrt(avg(x * x) - avg(x) * avg(x)) * sqrt(avg(y * y) - avg(y) * avg(y))) 
        AS correlation_coefficient_population,
        -- For Sample
        (count(*) * sum(x * y) - sum(x) * sum(y)) / 
        (sqrt(count(*) * sum(x * x) - sum(x) * sum(x)) * sqrt(count(*) * sum(y * y) - sum(y) * sum(y))) 
        AS correlation_coefficient_sample
    FROM (
        -- The following generates a table of sample data containing two columns with a luke-warm and tweakable correlation 
        -- y = x for 0 thru 99, y = x - 100 for 100 thru 199, etc.  Execute it as a stand-alone to see for yourself
        -- x and y are CAST as DECIMAL to avoid integer math, you should definitely do the same
        -- Try TOP 100 or less for full correlation (y = x for all cases), TOP 200 for a PCC of 0.5, TOP 300 for one near 0.33, etc.
        -- The superfluous "+ 0" is where you could apply various offsets to see that they have no effect on the results
        SELECT TOP 200
                CAST(ROW_NUMBER() OVER (ORDER BY [object_id]) - 1 + 0 AS DECIMAL) AS x, 
                CAST((ROW_NUMBER() OVER (ORDER BY [object_id]) - 1) % 100 AS DECIMAL) AS y 
            FROM sys.all_objects
    ) AS a

正如我在评论中所指出的，您可以尝试使用TOP 100或更低的示例进行完全关联（对于所有情况，y = x）; TOP 200产生的相关性非常接近0.5; TOP 300，约0.33;等等。如果你愿意，有一个地方（＆＃34; + 0＆＃34;）可以添加一个偏移;扰流警报，它没有任何影响。确保将值设置为DECIMAL - 整数数学可以显着影响这些计算。

Pearson Correlation SQL Server

2 个答案:

1。计算平均值

2。计算标准差

3。最后`Pearson Correlation Coefficient`

单通道解决方案：

Pearson Correlation SQL Server

2 个答案:

1。计算平均值

2。计算标准差

3。最后Pearson Correlation Coefficient

单通道解决方案：

3。最后`Pearson Correlation Coefficient`