Question

我正在对大型numpy数组中的缺失值进行均值插补。

>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
...
>>> X_train_reshaped.shape
(6794600, 19)

>>> imp = Imputer()
>>> X_train_reshaped_imputed = imp.fit_transform(X_train_reshaped)
>>> imp.statistics_
array([2.07519836e+04, 8.74635740e-02, 1.91142597e+02, 5.75686713e+01,
       1.97315320e+02, 1.97201888e+02, 2.01269915e+02, 3.36951689e+03,
       1.57597208e+02, 1.91400415e+02, 1.70882735e+02, 4.74499857e+02,
       4.74940899e+02, 4.72919241e+02, 6.79176139e+01, 9.35082042e+00,
       4.07991846e-02, 3.84304215e+01, 7.68858280e+02])

到目前为止很好。

但是结果数组的均值与imp.statistics_不匹配：

>>> np.mean(X_train_reshaped_imputed, axis=0)
array([2.0692746e+04, 8.7463573e-02, 1.9071404e+02, 5.9913071e+01,
       2.0061151e+02, 1.9948010e+02, 2.0548715e+02, 3.4639802e+03,
       1.5306616e+02, 1.9219826e+02, 1.7292293e+02, 4.9702396e+02,
       4.9672128e+02, 4.9482492e+02, 6.7616440e+01, 9.2078524e+00,
       4.0943827e-02, 3.7669365e+01, 7.6714471e+02], dtype=float32)

由于均值插补不会改变均值，为什么这里有区别？

但是我在较小数组上的实验给出了预期的结果：

X = np.array([[2, 2], [3, 3], [np.NaN, 4], [7, np.NaN], [6, 8], [np.NaN, np.NaN]])
print('X')
print(X)
print()

imp = Imputer()
X_transform = imp.fit_transform(X)
print('X_transform')
print(X_transform)
print()

print('Imputer statistics')
print(imp.statistics_)
print()

print('Mean of result')
print(np.mean(X_transform, axis=0))

我得到的输出是：

X
[[ 2.  2.]
 [ 3.  3.]
 [nan  4.]
 [ 7. nan]
 [ 6.  8.]
 [nan nan]]

X_transform
[[2.   2.  ]
 [3.   3.  ]
 [4.5  4.  ]
 [7.   4.25]
 [6.   8.  ]
 [4.5  4.25]]

Imputer statistics
[4.5  4.25]

Mean of result
[4.5  4.25]

Scikit-学习均值插补在插补后给出不同的均值

0 个答案: