请帮助我,因为我一直试图找到使用SQL SERVER 2008 R2开发人员版的CHI-SQUARED测试。问题是查询在以下样本数据集上正常工作:
sessionnumber sessioncount timespent cnt
1 17 28 45
2 22 8 30
3 1 1 2
4 1 1 2
5 8 111 119
6 8 65 73
7 11 5 16
8 1 1 2
9 62 64 126
10 6 42 48
所以,我一直在尝试的查询是:
SELECT sessionnumber, sessioncount, timespent, expected, dev,
dev*dev/cast(expected as float) as chi_square
FROM (SELECT d3.sessionnumber, d3.sessioncount, d3.timespent,
(dim1.cnt * dim2.cnt * dim3.cnt)/cast((dimall.cnt*dimall.cnt)as float) as expected,
d3.cnt-(dim1.cnt * dim2.cnt * dim3.cnt)/(dimall.cnt*dimall.cnt) as dev FROM d3 JOIN
(SELECT sessionnumber, SUM(cast(cnt as float)) as cnt FROM d3
GROUP BY sessionnumber) dim1
ON d3.sessionnumber = dim1.sessionnumber JOIN
(SELECT sessioncount, SUM(cast(cnt as float)) as cnt FROM d3
GROUP BY sessioncount) dim2
ON d3.sessioncount = dim2.sessioncount JOIN
(SELECT timespent, SUM(cast(cnt as float)) as cnt FROM d3
GROUP BY timespent) dim3
ON d3.timespent = dim3.timespent CROSS JOIN
(SELECT SUM(cast(cnt as float)) as cnt FROM d3) dimall) a
此查询生成的结果错误,结果为:
sessionnumber sessioncount timespent expected dev chi_square
1 17 28 2.37921034130308E-09 44.9999999976208 851122729517.387
2 22 8 1.72099699796333E-10 29.9999999998279 5229526844351.02
3 1 1 1.3008335197251E-11 1.99999999998699 307495151323.689
4 1 1 1.3008335197251E-11 1.99999999998699 307495151323.689
5 8 111 1.90995107994937E-07 118.999999809005 74143260019.6156
6 8 65 5.09110109296227E-09 72.9999999949089 1046728379961.52
7 11 5 5.36406353430159E-11 15.9999999999464 4772501264409.71
8 1 1 1.3008335197251E-11 1.99999999998699 307495151323.689
9 62 64 6.56781317803123E-09 125.999999993432 2417242934291.85
10 6 42 1.41737398829092E-09 47.9999999985826 1625541331291.19
作为会话编号1和会话编号2的正确卡方检验应该等于9.117,因为我的查询给出了错误的结果。 (这个卡方是前两个会话数行的样本但正确的值)。 所以我一直试图得到答案,并在过去3天工作。最后发现我的这个查询有问题,它给了我错误的结果。
请有人帮助我,我将为您提供帮助! (我也会在这个问题的2天后申请赏金)。 在此先感谢请帮助我,因为我对SQL查询有一点了解,因为我很新,因为它只使用了大约3个月!所以我真的需要一些帮助!
答案 0 :(得分:3)
卡方值在二维列联表上定义,而不是在三维列联表上定义。您似乎正在将二维公式调整为三维。并且,他们只是不工作。
你可以将卡方推广到更高维度的测试。我在这篇blog帖子中讨论了这一点,以及为什么我反对这种方法的原因。
我建议你将问题重新解释为二维卡方检验,并将代码中的算法应用于此问题。也就是说,一次分析两个维度。
编辑:
我认为你不了解卡方检验。当您有分类变量的两个维度时,它会应用。例如,您可能有“颜色”和“响应”以及具有以下内容的矩阵:
Color Yes No
Red 18 203
Blue 10 182
Green 22 134
并且您想知道矩阵是随机创建的概率(似然) - 假设边缘的分布(维度上的总数)是相同的。
您的示例有两个或三个(如果包含“sessionnumber”)数字变量。您应该考虑其他统计技术。实际上,我会从单变量相关分析(Pearson相关)和线性回归开始。
编辑II:
我正在为卡方查询提供正确的表单,即使我不提倡对您的数据使用卡方检验。据推测,这些列是相关的(即使它们不在同一个桶中,具有高会话数的实例也似乎相似)。
您的查询格式正确,只需删除其中一个维度:
SELECT sessioncount, timespent, expected, dev,
dev*dev/cast(expected as float) as chi_square
FROM (SELECT d3.sessionnumber, d3.sessioncount, d3.timespent,
(dim2.cnt * dim3.cnt)/cast((dimall.cnt*dimall.cnt)as float) as expected,
d3.cnt-(dim2.cnt * dim3.cnt)/(dimall.cnt*dimall.cnt) as dev
FROM d3 JOIN
(SELECT sessioncount, SUM(cast(cnt as float)) as cnt
FROM d3
GROUP BY sessioncount
) dim2
ON d3.sessioncount = dim2.sessioncount JOIN
(SELECT timespent, SUM(cast(cnt as float)) as cnt
FROM d3
GROUP BY timespent
) dim3
ON d3.timespent = dim3.timespent CROSS JOIN
(SELECT SUM(cast(cnt as float)) as cnt
FROM d3
) dimall
) a
适用于表格中的单元格。但是,要获得完整的卡方值,您需要考虑所有单元格,即使是计数为0的单元格:
SELECT sessioncount, timespent, cnt, expected, dev,
dev*dev/cast(expected as float) as chi_square
FROM (SELECT allcells.sessioncount, allcells.timespent,
cells.cnt,
(dim2.cnt * dim3.cnt)/cast(dimall.cnt as float) as expected,
coalesce(cells.cnt, 0) - (dim2.cnt * dim3.cnt)/dimall.cnt as dev
FROM (select sc.sessioncount, ts.timespent
from (select distinct sessioncount from d3) sc cross join
(select distinct timespent from d3) ts
) allcells left join
(select sessioncount, timespent, sum(cnt) as cnt
from d3
group by sessioncount, timespent
) cells
on allcells.sessioncount = cells.sessioncount and
allcells.timespent = cells.timespent left JOIN
(SELECT sessioncount, SUM(cast(cnt as float)) as cnt
FROM d3
GROUP BY sessioncount
) dim2
ON allcells.sessioncount = dim2.sessioncount left JOIN
(SELECT timespent, SUM(cast(cnt as float)) as cnt
FROM d3
GROUP BY timespent
) dim3
ON allcells.timespent = dim3.timespent CROSS JOIN
(SELECT SUM(cast(cnt as float)) as cnt
FROM d3
) dimall
) a
Here是一个有效的SQL小提琴。
并且,您的原始查询可能适用于多维卡方。但是,我没有仔细研究数据。通常,当数据具有cnt时,它采用列联表(可能缺少“0”单元格)。您的数据的单元格分为多行(特别是“1,1”)。因此,上述版本将此考虑在内。
而且,因为您的原始问题是关于三维卡方,所以这是正确的查询:
SELECT sessioncount, timespent, cnt, expected, dev,
dev*dev/cast(expected as float) as chi_square
FROM (SELECT allcells.sessionnumber, allcells.sessioncount, allcells.timespent,
cells.cnt,
(dim1.cnt * dim2.cnt * dim3.cnt)/cast(dimall.cnt*dimall.cnt as float) as expected,
coalesce(cells.cnt, 0) - (dim1.cnt * dim2.cnt * dim3.cnt)/(dimall.cnt*dimall.cnt) as dev
FROM (select sn.sessionnumber, sc.sessioncount, ts.timespent
from (select distinct sessioncount from d3) sc cross join
(select distinct timespent from d3) ts cross join
(select distinct sessionnumber from d3) sn
) allcells left join
(select sessionnumber, sessioncount, timespent, sum(cnt) as cnt
from d3
group by sessionnumber, sessioncount, timespent
) cells
on allcells.sessioncount = cells.sessioncount and
allcells.timespent = cells.timespent and
allcells.sessionnumber = cells.sessionnumber left JOIN
(SELECT sessionnumber, SUM(cast(cnt as float)) as cnt
FROM d3
GROUP BY sessionnumber
) dim1
ON allcells.sessionnumber = dim1.sessionnumber left JOIN
(SELECT sessioncount, SUM(cast(cnt as float)) as cnt
FROM d3
GROUP BY sessioncount
) dim2
ON allcells.sessioncount = dim2.sessioncount left JOIN
(SELECT timespent, SUM(cast(cnt as float)) as cnt
FROM d3
GROUP BY timespent
) dim3
ON allcells.timespent = dim3.timespent CROSS JOIN
(SELECT SUM(cast(cnt as float)) as cnt
FROM d3
) dimall
) a
及其对应的SQL Fiddle。
对于两个SQL Fiddle版本,我已经验证了期望值的总和等于原始计数的总和,这是对算术的良好验证。