Question

我试图通过用平均值替换缺失值来预处理我的数据。

我的代码如下：

#Load the Data 
import numpy as np
data_2 = np.genfromtxt('data.csv', delimiter=',', skip_header=1)

#the missing values in my dataset are identified by value = 0 
#I'm trying to replace the missing values in the third column 
from sklearn.preprocessing import Imputer 
imp = Imputer(missing_values=0, strategy='mean', axis=0)
imp.fit(data_2[:, 2])

它运行但是发出了这些警告：

/Users/user1/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)

/Users/user1/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
  DeprecationWarning)

但我的主要问题是它没有填充缺失的数据，我在拟合之前和之后打印了数据并且没有变化。

我做错了什么？

更新：这是我的数据集的几行：
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0

Answer 1

您分享的前几行不包含任何空值，因此很难解释

考虑这个稍微更新的数据集版本，让您了解。

6,148,72,35,0,33.6,0.627,50,1
1,85,,29,0,26.6,0.351,,
,183,64,,0,,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0

使用库pandas

可以轻松填充缺失值

#Load Libraries and data
import pandas as pd
df = pd.read_csv('data.csv',names=[1,2,3,4,5,6,7,8,9])

#Fill the Null values with the mean
df = df.fillna(df.mean())

名称 read_csv 函数中的参数用于为csv文件的列命名

fillna（）函数将填充缺失值。

Scikit-learn：替换丢失数据时出错

1 个答案: