对于相同的列联表,sklearn和scipy.stats中的不同chi2值

时间:2018-06-14 15:38:32

标签: python machine-learning scipy scikit-learn feature-selection

X=np.array([7.20E+01,2.40E+01,0.00E+00,9.00E+00,0.00E+00,3.00E+00,0.00E+00,5.40E01,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,3.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,1.50E+01,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,1.11E+02,2.70E+01,0.00E+00,6.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
3.00E+00,
0.00E+00,
0.00E+00,
1.70E+01,
3.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
8.00E+00,
5.20E+01,
1.80E+01,
5.20E+01,
5.20E+01,
5.00E+01,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00,
0.00E+00])


y=np.array([0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
0.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00
1.00E+00])

这是X(我为了简单起见,我刚刚采用了一个功能)和146个样本的y。前73个是类(0),另外73个是类(1)。

现在我想计算这个特征的chi2分数。我使用了sklearn.feature_selection.chi2,它给了我答案579,如果我在scipy.stats.chi2_contingency给它21作为答案。

使用的代码 -

obs = np.array([[0, 19], [73,54]])
scipy.stats.chi2_contingency(obs,correction=False)

这给出了21作为答案我认为应该是正确的答案,因为公式是(a*d-b*c)**2*float(n)/((a+c)*(b+d)*(a+b)*(c+d))

但是sklearn用这段代码给出了579 -

X_d= X.reshape(-1,1)
y_d=y.reshape(-1,1)
print(sklearn.feature_selection.chi2(X_d, y_d))

为什么两种情况下chi2值都不同?

编辑 - 如何在scipy案例中创建列联表

enter image description here

enter image description here

参考公式 -

所以我在0(负类)得到的非零值的数量是19.样本的总数是146,并且positve类中的非零值的数量是0.基于此信息,我获得的ABCDN的值是0 19 73 54 146我将其提供给scipy.stats函数。

0 个答案:

没有答案