CHI-SQUARE TEST的SQL查询提供了错误的结果

时间:2013-08-04 20:38:31

标签: sql sql-server-2008 join chi-squared

请帮助我,因为我一直试图找到使用SQL SERVER 2008 R2开发人员版的CHI-SQUARED测试。问题是查询在以下样本数据集上正常工作:

sessionnumber   sessioncount    timespent          cnt
    1                  17               28          45
    2                  22               8           30
    3                  1                1           2
    4                  1                1           2
    5                  8               111          119
    6                  8                65          73
    7                  11               5           16
    8                  1                1           2
    9                  62               64          126
   10                  6                42          48

所以,我一直在尝试的查询是:

SELECT sessionnumber, sessioncount, timespent, expected, dev,
dev*dev/cast(expected as float) as chi_square

FROM (SELECT d3.sessionnumber, d3.sessioncount, d3.timespent,
(dim1.cnt * dim2.cnt * dim3.cnt)/cast((dimall.cnt*dimall.cnt)as float) as expected,
d3.cnt-(dim1.cnt * dim2.cnt * dim3.cnt)/(dimall.cnt*dimall.cnt) as dev FROM d3 JOIN

(SELECT sessionnumber, SUM(cast(cnt as float)) as cnt FROM d3
GROUP BY sessionnumber) dim1
ON d3.sessionnumber = dim1.sessionnumber JOIN

(SELECT sessioncount, SUM(cast(cnt as float)) as cnt FROM d3
GROUP BY sessioncount) dim2
ON d3.sessioncount = dim2.sessioncount JOIN

(SELECT timespent, SUM(cast(cnt as float)) as cnt FROM d3
GROUP BY timespent) dim3
ON d3.timespent = dim3.timespent CROSS JOIN

(SELECT SUM(cast(cnt as float)) as cnt FROM d3) dimall) a

此查询生成的结果错误,结果为:

sessionnumber   sessioncount    timespent          expected                              dev            chi_square
    1                  17               28          2.37921034130308E-09        44.9999999976208    851122729517.387
    2                  22               8           1.72099699796333E-10        29.9999999998279    5229526844351.02
    3                  1                1           1.3008335197251E-11         1.99999999998699    307495151323.689
    4                  1                1           1.3008335197251E-11         1.99999999998699    307495151323.689
    5                  8               111          1.90995107994937E-07        118.999999809005    74143260019.6156
    6                  8                65          5.09110109296227E-09        72.9999999949089    1046728379961.52
    7                  11               5           5.36406353430159E-11        15.9999999999464    4772501264409.71
    8                  1                1           1.3008335197251E-11         1.99999999998699    307495151323.689
    9                  62               64          6.56781317803123E-09        125.999999993432    2417242934291.85
   10                  6                42          1.41737398829092E-09        47.9999999985826    1625541331291.19

作为会话编号1和会话编号2的正确卡方检验应该等于9.117,因为我的查询给出了错误的结果。 (这个卡方是前两个会话数行的样本但正确的值)。 所以我一直试图得到答案,并在过去3天工作。最后发现我的这个查询有问题,它给了我错误的结果。

请有人帮助我,我将为您提供帮助! (我也会在这个问题的2天后申请赏金)。 在此先感谢请帮助我,因为我对SQL查询有一点了解,因为我很新,因为它只使用了大约3个月!所以我真的需要一些帮助!

1 个答案:

答案 0 :(得分:3)

卡方值在二维列联表上定义,而不是在三维列联表上定义。您似乎正在将二维公式调整为三维。并且,他们只是不工作。

可以将卡方推广到更高维度的测试。我在这篇blog帖子中讨论了这一点,以及为什么我反对这种方法的原因。

我建议你将问题重新解释为二维卡方检验,并将代码中的算法应用于此问题。也就是说,一次分析两个维度。

编辑:

我认为你不了解卡方检验。当您有分类变量的两个维度时,它会应用。例如,您可能有“颜色”和“响应”以及具有以下内容的矩阵:

Color     Yes     No
Red        18    203
Blue       10    182
Green      22    134

并且您想知道矩阵是随机创建的概率(似然) - 假设边缘的分布(维度上的总数)是相同的。

您的示例有两个或三个(如果包含“sessionnumber”)数字变量。您应该考虑其他统计技术。实际上,我会从单变量相关分析(Pearson相关)和线性回归开始。

编辑II:

我正在为卡方查询提供正确的表单,即使我不提倡对您的数据使用卡方检验。据推测,这些列是相关的(即使它们不在同一个桶中,具有高会话数的实例也似乎相似)。

您的查询格式正确,只需删除其中一个维度:

SELECT sessioncount, timespent, expected, dev,
       dev*dev/cast(expected as float) as chi_square
FROM (SELECT d3.sessionnumber, d3.sessioncount, d3.timespent,
             (dim2.cnt * dim3.cnt)/cast((dimall.cnt*dimall.cnt)as float) as expected,
             d3.cnt-(dim2.cnt * dim3.cnt)/(dimall.cnt*dimall.cnt) as dev
      FROM d3 JOIN
           (SELECT sessioncount, SUM(cast(cnt as float)) as cnt
            FROM d3
            GROUP BY sessioncount
           ) dim2
           ON d3.sessioncount = dim2.sessioncount JOIN
           (SELECT timespent, SUM(cast(cnt as float)) as cnt
            FROM d3
            GROUP BY timespent
           ) dim3
           ON d3.timespent = dim3.timespent CROSS JOIN
           (SELECT SUM(cast(cnt as float)) as cnt
            FROM d3
          ) dimall
     ) a

适用于表格中的单元格。但是,要获得完整的卡方值,您需要考虑所有单元格,即使是计数为0的单元格:

SELECT sessioncount, timespent, cnt, expected, dev,
       dev*dev/cast(expected as float) as chi_square
FROM (SELECT allcells.sessioncount, allcells.timespent,
             cells.cnt,
             (dim2.cnt * dim3.cnt)/cast(dimall.cnt as float) as expected,
             coalesce(cells.cnt, 0) - (dim2.cnt * dim3.cnt)/dimall.cnt as dev
      FROM (select sc.sessioncount, ts.timespent
            from (select distinct sessioncount from d3) sc cross join
                 (select distinct timespent from d3) ts
           ) allcells left join
           (select sessioncount, timespent, sum(cnt) as cnt
            from d3
            group by sessioncount, timespent
           ) cells
           on allcells.sessioncount = cells.sessioncount and
              allcells.timespent = cells.timespent left JOIN
           (SELECT sessioncount, SUM(cast(cnt as float)) as cnt
            FROM d3
            GROUP BY sessioncount
           ) dim2
           ON allcells.sessioncount = dim2.sessioncount left JOIN
           (SELECT timespent, SUM(cast(cnt as float)) as cnt
            FROM d3
            GROUP BY timespent
           ) dim3
           ON allcells.timespent = dim3.timespent CROSS JOIN
           (SELECT SUM(cast(cnt as float)) as cnt
            FROM d3
          ) dimall
     ) a

Here是一个有效的SQL小提琴。

并且,您的原始查询可能适用于多维卡方。但是,我没有仔细研究数据。通常,当数据具有cnt时,它采用列联表(可能缺少“0”单元格)。您的数据的单元格分为多行(特别是“1,1”)。因此,上述版本将此考虑在内。

而且,因为您的原始问题是关于三维卡方,所以这是正确的查询:

SELECT sessioncount, timespent, cnt, expected, dev,
       dev*dev/cast(expected as float) as chi_square
FROM (SELECT allcells.sessionnumber, allcells.sessioncount, allcells.timespent,
             cells.cnt,
             (dim1.cnt * dim2.cnt * dim3.cnt)/cast(dimall.cnt*dimall.cnt as float) as expected,
             coalesce(cells.cnt, 0) - (dim1.cnt * dim2.cnt * dim3.cnt)/(dimall.cnt*dimall.cnt) as dev
      FROM (select sn.sessionnumber, sc.sessioncount, ts.timespent
            from (select distinct sessioncount from d3) sc cross join
                 (select distinct timespent from d3) ts cross join
                 (select distinct sessionnumber from d3) sn
           ) allcells left join
           (select sessionnumber, sessioncount, timespent, sum(cnt) as cnt
            from d3
            group by sessionnumber, sessioncount, timespent
           ) cells
           on allcells.sessioncount = cells.sessioncount and
              allcells.timespent = cells.timespent and
              allcells.sessionnumber = cells.sessionnumber left JOIN
           (SELECT sessionnumber, SUM(cast(cnt as float)) as cnt
            FROM d3
            GROUP BY sessionnumber
           ) dim1
           ON allcells.sessionnumber = dim1.sessionnumber left JOIN
            (SELECT sessioncount, SUM(cast(cnt as float)) as cnt
            FROM d3
            GROUP BY sessioncount
           ) dim2
           ON allcells.sessioncount = dim2.sessioncount left JOIN
           (SELECT timespent, SUM(cast(cnt as float)) as cnt
            FROM d3
            GROUP BY timespent
           ) dim3
           ON allcells.timespent = dim3.timespent CROSS JOIN
           (SELECT SUM(cast(cnt as float)) as cnt
            FROM d3
          ) dimall
     ) a

及其对应的SQL Fiddle

对于两个SQL Fiddle版本,我已经验证了期望值的总和等于原始计数的总和,这是对算术的良好验证。