我正在使用scikit学习计算基本的卡方统计量(sklearn.feature_selection.chi2(X,y)):
def chi_square(feat,target):
""" """
from sklearn.feature_selection import chi2
ch,pval = chi2(feat,target)
return ch,pval
chisq,p = chi_square(feat_mat,target_sc)
print(chisq)
print("**********************")
print(p)
我有1500个样本,45个功能,4个类。输入是具有1500x45的特征矩阵和具有1500个组件的目标阵列。特征矩阵不稀疏。当我运行程序并使用45个组件打印arrray“chisq”时,我可以看到组件13具有负值并且p = 1.如何可能?或者它是什么意思或我正在做的最大错误是什么?
我附上了chisq和p的打印输出:
[ 9.17099260e-01 3.77439701e+00 5.35004211e+01 2.17843312e+03
4.27047184e+04 2.23204883e+01 6.49985540e-01 2.02132664e-01
1.57324454e-03 2.16322638e-01 1.85592258e+00 5.70455805e+00
1.34911126e-02 -1.71834753e+01 1.05112366e+00 3.07383691e-01
5.55694752e-02 7.52801686e-01 9.74807972e-01 9.30619466e-02
4.52669897e-02 1.08348058e-01 9.88146259e-03 2.26292358e-01
5.08579194e-02 4.46232554e-02 1.22740419e-02 6.84545170e-02
6.71339545e-03 1.33252061e-02 1.69296016e-02 3.81318236e-02
4.74945604e-02 1.59313146e-01 9.73037448e-03 9.95771327e-03
6.93777954e-02 3.87738690e-02 1.53693158e-01 9.24603716e-04
1.22473138e-01 2.73347277e-01 1.69060817e-02 1.10868365e-02
8.62029628e+00]
**********************
[ 8.21299526e-01 2.86878266e-01 1.43400668e-11 0.00000000e+00
0.00000000e+00 5.59436980e-05 8.84899894e-01 9.77244281e-01
9.99983411e-01 9.74912223e-01 6.02841813e-01 1.26903019e-01
9.99584918e-01 1.00000000e+00 7.88884155e-01 9.58633878e-01
9.96573548e-01 8.60719653e-01 8.07347364e-01 9.92656816e-01
9.97473024e-01 9.90817144e-01 9.99739526e-01 9.73237195e-01
9.96995722e-01 9.97526259e-01 9.99639669e-01 9.95333185e-01
9.99853998e-01 9.99592531e-01 9.99417113e-01 9.98042114e-01
9.97286030e-01 9.83873717e-01 9.99745466e-01 9.99736512e-01
9.95239765e-01 9.97992843e-01 9.84693908e-01 9.99992525e-01
9.89010468e-01 9.64960636e-01 9.99418323e-01 9.99690553e-01
3.47893682e-02]
答案 0 :(得分:1)
如果你在the code
defining中加入一些印刷语句
chi2
,
def chi2(X, y):
X = atleast2d_or_csr(X)
Y = LabelBinarizer().fit_transform(y)
if Y.shape[1] == 1:
Y = np.append(1 - Y, Y, axis=1)
observed = safe_sparse_dot(Y.T, X) # n_classes * n_features
print(repr(observed))
feature_count = array2d(X.sum(axis=0))
class_prob = array2d(Y.mean(axis=0))
expected = safe_sparse_dot(class_prob.T, feature_count)
print(repr(expected))
return stats.chisquare(observed, expected)
你会发现expected
最终会产生一些负面影响
值。
import numpy as np
import sklearn.feature_selection as FS
x = np.array([-0.23918515, -0.29967287, -0.33007592, 0.07383528, -0.09205183,
-0.12548226, 0.04770942, -0.54318463, -0.16833203, -0.00332341,
0.0179646, -0.0526383, 0.04288736, -0.27427317, -0.16136621,
-0.09228812, -0.2255725, -0.03744027, 0.02953499, -0.17387492])
y = np.array([1, 2, 2, 1, 1, 1, 1, 3, 1, 1, 3, 2, 2, 1, 1, 2, 1, 2, 1, 1],
dtype = 'int64')
FS.chi2(x.reshape(-1,1),y)
产量
observed:
array([[-1.31238179],
[-0.76922812],
[-0.52522003]])
expected:
array([[-1.56409796],
[-0.78204898],
[-0.26068299]])
然后调用 stats.chisquared(observed, expected)
。那里,observed
并假设expected
是类别的频率。它们应该都是
非负数,因为频率是非负数。
我对scikits不够熟悉 - 学会建议如何解决您的问题,但是您发送给chi2
的种数据似乎是错误的排序,因为expected
应该是非负面的。
(例如,可能是上面的x
值都应该是正数并代表观测频率吗?)