我使用的数据集缺少很多值,我认为我可以使用KNeighbors解决方案来解决该问题。为此,更简单的方法是使用sklearn.impute中的IterativeImputer。 为此,我使用了代码:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.neighbors import KNeighborsRegressor
opened_file = pd.read_csv(input_file, sep = ",", header = 0, na_values = "NaN", dtype = str)
opened_file.drop(opened_file.loc[opened_file[class_col] == np.nan].index, inplace = True)
input_estimator = IterativeImputer(random_state=42, estimator=KNeighborsRegressor(n_neighbors=1))
usable_data = opened_file[cols]
usable_data = input_estimator.fit_transform(usable_data)
但是,这产生了错误:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
SimpleImputer也是如此。但是,当我从sklearn.preprocessing运行(已弃用的)Imputer时,代码运行得很好:
from sklearn.preprocessing import Imputer
opened_file = pd.read_csv(input_file, sep = ",", header = 0, na_values = "NaN", dtype = str)
opened_file.drop(opened_file.loc[opened_file[class_col] == np.nan].index, inplace = True)
usable_data = opened_file[cols]
usable_data = Imputer().fit_transform(usable_data)
因此,产生输出:
[[ 0. 26. 4. ... 48.923 72.615
100. ]
[ 0. 26. 4. ... 48.923 72.615
100. ]
[ 0. 26. 4. ... 48.923 72.615
100. ]
...
[ 1. 10. 3. ... 49.63712147 73.50532432
99.12231621]
[ 1. 10. 3. ... 49.63712147 73.50532432
99.12231621]
[ 0.979414 23.16310899 3.95972961 ... 49.63712147 73.50532432
99.12231621]]
所有操作均使用pandas数据框执行。我可以使用Imputer,但我想部署一个K近邻来解决缺失值。