在使用Pandas从csv读取到DataFrame时转换空值后设置数据类型

时间:2016-09-01 17:32:50

标签: python pandas

我有一个带有GPS数据的.csv文件,如下所示:

chown -R new_owner path

最后一行的值为空或“null”。我想将数据读入数据帧并将null值设置为-1,并以float类型读取数据。使用我的代码,数据类型设置为字符串,并且不替换空值。

我是怎么做的(错误的):

chown -R xxx /opt/elasticsearch-2.3.5

测试输出的代码:

ID,GPS_LATITUDE,GPS_LONGITUDE
1,35.66727683,139.7591279
2,35.66727683,139.7591279
3,-1,-1
4,35.66750697,139.7589757
5,,139.7589757

输出:

data = r'c:\temp\gps.csv'

def conv(val):
    if val == np.nan:
        return -1
    return val

df = pd.read_csv(data,converters={'GPS_LATITUDE':conv,'GPS_LONGITUDE':conv},dtype={'GPS_LATITUDE':np.float64,'GPS_LONGITUDE':np.float64})

1 个答案:

答案 0 :(得分:1)

首先,您甚至不需要使用任何转换函数:

$ cat /tmp/a.csv
ID,GPS_LATITUDE,GPS_LONGITUDE
1,35.66727683,139.7591279
2,35.66727683,139.7591279
3,-1,-1
4,35.66750697,139.7589757
5,,139.7589757

In [15]: df = pd.read_csv("/tmp/a.csv", dtype={'GPS_LATITUDE':np.float64,'GPS_LONGITUDE':np.float64})

In [16]: df
Out[16]: 
   ID  GPS_LATITUDE  GPS_LONGITUDE
0   1     35.667277     139.759128
1   2     35.667277     139.759128
2   3     -1.000000      -1.000000
3   4     35.667507     139.758976
4   5           NaN     139.758976

In [18]: df.dtypes
Out[18]: 
ID                 int64
GPS_LATITUDE     float64
GPS_LONGITUDE    float64
dtype: object

In [19]: df.fillna(-1, inplace = True)

In [20]: df
Out[20]: 
   ID  GPS_LATITUDE  GPS_LONGITUDE
0   1     35.667277     139.759128
1   2     35.667277     139.759128
2   3     -1.000000      -1.000000
3   4     35.667507     139.758976
4   5     -1.000000     139.758976

其次,如果您确实要使用conv,请将其更改为(如果您对所有列使用conv,则无需指定dtype):

In [21]: def conv(val):
   ....:     if not val:
   ....:         return -1
   ....:     return np.float64(val)
   ....: 

In [22]: df = pd.read_csv("/tmp/a.csv", converters={'GPS_LATITUDE':conv,'GPS_LONGITUDE':conv})

In [23]: df
Out[23]: 
   ID  GPS_LATITUDE  GPS_LONGITUDE
0   1     35.667277     139.759128
1   2     35.667277     139.759128
2   3     -1.000000      -1.000000
3   4     35.667507     139.758976
4   5     -1.000000     139.758976

In [24]: df.dtypes
Out[24]: 
ID                 int64
GPS_LATITUDE     float64
GPS_LONGITUDE    float64
dtype: object

在任何一种情况下:

In [26]: lats = df['GPS_LATITUDE'].tolist()

In [27]: for l in lats:
   ....:     print(l,type(l))
   ....:     
(35.667276829999999, <type 'numpy.float64'>)
(35.667276829999999, <type 'numpy.float64'>)
(-1.0, <type 'numpy.float64'>)
(35.667506969999998, <type 'numpy.float64'>)
(-1.0, <type 'numpy.float64'>)