Python:如何在CSV文件中输入缺失的值?

时间:2016-03-16 05:24:44

标签: python csv imputation

我有必须使用Python分析的CSV数据。数据中有一些缺失值。数据样本如下:

样品

ID,ID_TYPE,OB_DATE,VERSION_NUM,MET_DOMAIN_NAME,OB_END_CTIME,OB_DAY_CNT,SRC_ID,REC_ST_IND,PRCP_AMT,OB_DAY_CNT_Q,PRCP_AMT_Q,METO_STMP_TIME,MIDAS_STMP_ETIME,PRCP_AMT_J
90, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,24109,1011,0,0,6, 2006-01-17 09:04,0,
150, RAIN, 2006-01-01 00:00,1, DLY3208,900,1,30747,1011,0,0,6, 2006-01-09 13:21,3,
174, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,24775,1011,0.2,0,6, 2006-01-17 09:04,0,
498, RAIN, 2006-01-01 00:00,0, WADRAIN,900,1,1622,1012,0.1,0,1, 2006-01-17 09:04,0,
498, RAIN,,1, WADRAIN,900,31,1622,1022,58.3,0,22576, 2006-03-15 11:41,0,
898, RAIN, 2006-01-01 00:00,0, WADRAIN,900,6,1624,1012,18.5,0,20001,,0,
898, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1624,1022,0.4,0,2576, 2006-03-15 11:41,0,
996, RAIN, 2006-01-01 00:00,1, WAMRAIN,900,31,24953,1011,53.5,0,6, 2006-01-31 13:51,0,
997, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,24953,1011,1.6,0,6, 2006-02-02 12:28,0,
1045, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1628,1011,1.1,0,6, 2006-01-17 09:04,0,
1103, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,24772,1011,2.5,0,6, 2006-01-17 09:04,0,
1358, RAIN, 2006-01-01 00:00,0, WADRAIN,900,11,1633,1012,17.7,0,20001,,0,
1358, RAIN,,1, WADRAIN,900,31,1633,1022,42.5,0,22576, 2006-03-15 11:41,0,
1545, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1636,1011,2,0,6, 2006-01-17 09:04,0,
1584, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,315,1014,2.4,0,2306, 2006-03-15 11:41,0,
1858, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1645,1011,0.2,0,6, 2006-01-17 09:04,0,
2247, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,24781,1011,0.5,0,6, 2006-01-17 09:04,0,
3066, RAIN,,1, WADRAIN,900,1,1655,1011,0.6,0,6, 2006-02-02 12:28,0,
3067, RAIN, 2006-01-01 00:00,0, WADRAIN,900,7,1655,1012,11,0,20001, 2006-01-26 15:08,0,
3067, RAIN, 2006-01-01 00:00,1, WADRAIN,900,31,1655,1022,57.5,0,22576, 2006-03-15 11:41,0,
3507, RAIN, 2006-01-01 00:00,0, WADRAIN,900,2,1657,1012,15.8,0,20001,,0,
3507, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1657,1022,0.9,0,2576, 2006-04-13 13:28,0,
4802, RAIN,,0, WADRAIN,900,6,1663,1012,18,0,20001, 2006-01-17 09:04,0,
4802, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1663,1022,0.9,0,2576, 2006-03-15 11:41,0,
4941, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1664,1011,0.5,0,6, 2006-01-17 09:04,1,
4942, RAIN, 2006-01-01 00:00,1, WADRAIN,900,1,1664,1011,1.2,0,6, 2006-02-02 12:28,0,

数据遗失了OB_DATEMETO_STMP_TIME,我想在这些字段中输入缺失值。

这里的基本问题是:

  1. 缺失值的影响是什么?我们可以采取哪些措施?
  2. 我为此搜索了很多内容,我不清楚归责的概念。

    1. 如果不使用任何外部库,我们如何在Python中完成?
    2. 如果使用外部库,那么它很好,但是没有任何外部库可以实现它。

1 个答案:

答案 0 :(得分:-1)

我是初学者,我希望它能帮助您!

import pandas as pd
dataset=pd.read_csv('filename/path')
from sklearn.preprocessing import Imputer
imputer=Imputer(missing_values='Nan',strategy='mean',axis=0)
X=dataset.iloc[:,2].values
Y=dataset.iloc[:,-3].values
#lets do second column first
imputer=imputer.fit(X[:,2])
X[:,2]=imputer.transform(X[:,2])
# third last column
imputer=imputer.fit(Y[:,-3])
Y[:,-3]=imputer.transform(Y[:,-3])