我试图通过用平均值替换缺失值来预处理我的数据。
我的代码如下:
#Load the Data
import numpy as np
data_2 = np.genfromtxt('data.csv', delimiter=',', skip_header=1)
#the missing values in my dataset are identified by value = 0
#I'm trying to replace the missing values in the third column
from sklearn.preprocessing import Imputer
imp = Imputer(missing_values=0, strategy='mean', axis=0)
imp.fit(data_2[:, 2])
它运行但是发出了这些警告:
/Users/user1/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
/Users/user1/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.py:386: DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
DeprecationWarning)
但我的主要问题是它没有填充缺失的数据,我在拟合之前和之后打印了数据并且没有变化。
我做错了什么?
更新:
这是我的数据集的几行:
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
答案 0 :(得分:1)
考虑这个稍微更新的数据集版本,让您了解。
6,148,72,35,0,33.6,0.627,50,1
1,85,,29,0,26.6,0.351,,
,183,64,,0,,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
使用库pandas
可以轻松填充缺失值#Load Libraries and data
import pandas as pd
df = pd.read_csv('data.csv',names=[1,2,3,4,5,6,7,8,9])
#Fill the Null values with the mean
df = df.fillna(df.mean())
fillna()函数将填充缺失值。