我正在对大型numpy数组中的缺失值进行均值插补。
>>> import numpy as np
>>> from sklearn.preprocessing import Imputer
...
>>> X_train_reshaped.shape
(6794600, 19)
>>> imp = Imputer()
>>> X_train_reshaped_imputed = imp.fit_transform(X_train_reshaped)
>>> imp.statistics_
array([2.07519836e+04, 8.74635740e-02, 1.91142597e+02, 5.75686713e+01,
1.97315320e+02, 1.97201888e+02, 2.01269915e+02, 3.36951689e+03,
1.57597208e+02, 1.91400415e+02, 1.70882735e+02, 4.74499857e+02,
4.74940899e+02, 4.72919241e+02, 6.79176139e+01, 9.35082042e+00,
4.07991846e-02, 3.84304215e+01, 7.68858280e+02])
到目前为止很好。
但是结果数组的均值与imp.statistics_
不匹配:
>>> np.mean(X_train_reshaped_imputed, axis=0)
array([2.0692746e+04, 8.7463573e-02, 1.9071404e+02, 5.9913071e+01,
2.0061151e+02, 1.9948010e+02, 2.0548715e+02, 3.4639802e+03,
1.5306616e+02, 1.9219826e+02, 1.7292293e+02, 4.9702396e+02,
4.9672128e+02, 4.9482492e+02, 6.7616440e+01, 9.2078524e+00,
4.0943827e-02, 3.7669365e+01, 7.6714471e+02], dtype=float32)
由于均值插补不会改变均值,为什么这里有区别?
但是我在较小数组上的实验给出了预期的结果:
X = np.array([[2, 2], [3, 3], [np.NaN, 4], [7, np.NaN], [6, 8], [np.NaN, np.NaN]])
print('X')
print(X)
print()
imp = Imputer()
X_transform = imp.fit_transform(X)
print('X_transform')
print(X_transform)
print()
print('Imputer statistics')
print(imp.statistics_)
print()
print('Mean of result')
print(np.mean(X_transform, axis=0))
我得到的输出是:
X
[[ 2. 2.]
[ 3. 3.]
[nan 4.]
[ 7. nan]
[ 6. 8.]
[nan nan]]
X_transform
[[2. 2. ]
[3. 3. ]
[4.5 4. ]
[7. 4.25]
[6. 8. ]
[4.5 4.25]]
Imputer statistics
[4.5 4.25]
Mean of result
[4.5 4.25]