Question

我使用的数据集缺少很多值，我认为我可以使用KNeighbors解决方案来解决该问题。为此，更简单的方法是使用sklearn.impute中的IterativeImputer。为此，我使用了代码：

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import KNeighborsRegressor
opened_file = pd.read_csv(input_file, sep = ",", header = 0, na_values = "NaN", dtype = str)    
opened_file.drop(opened_file.loc[opened_file[class_col] == np.nan].index, inplace = True)
input_estimator = IterativeImputer(random_state=42, estimator=KNeighborsRegressor(n_neighbors=1))
usable_data = opened_file[cols]
usable_data = input_estimator.fit_transform(usable_data)

但是，这产生了错误：

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

SimpleImputer也是如此。但是，当我从sklearn.preprocessing运行（已弃用的）Imputer时，代码运行得很好：

from sklearn.preprocessing import Imputer
opened_file = pd.read_csv(input_file, sep = ",", header = 0, na_values = "NaN", dtype = str)    
opened_file.drop(opened_file.loc[opened_file[class_col] == np.nan].index, inplace = True)
usable_data = opened_file[cols]
usable_data = Imputer().fit_transform(usable_data)

因此，产生输出：

[[  0.          26.           4.         ...  48.923       72.615
  100.        ]
 [  0.          26.           4.         ...  48.923       72.615
  100.        ]
 [  0.          26.           4.         ...  48.923       72.615
  100.        ]
 ...
 [  1.          10.           3.         ...  49.63712147  73.50532432
   99.12231621]
 [  1.          10.           3.         ...  49.63712147  73.50532432
   99.12231621]
 [  0.979414    23.16310899   3.95972961 ...  49.63712147  73.50532432
   99.12231621]]

所有操作均使用pandas数据框执行。我可以使用Imputer，但我想部署一个K近邻来解决缺失值。

Scikit学习：Imputer可以工作，SimpleImputer和IterativeImputer不能

0 个答案: