如何使用pandas

时间:2017-08-01 04:04:20

标签: python pandas

具有特殊日期格式的xls数据,例如:

 start day(utc) start time(utc)
    20160401            100
    20160401            200
    20160401            300
    20160401            400
    20160401            500

我想将其解析为格式2016-04-01 1:00, 我用pandas读表;

    parse = lambda x: datetime.strptime(str(x), '%Y%m%d %H')
    content=pd.read_excel(filepath,skiprows=1,
                          na_values=['nan',-9999.0,9999.0,
                          '-9999.0 -',-99,'-99.000 -',-999],
                          parse_cols=[1,2,3,4,5,6,7,8,9,10,11,12,14],
                          header=None, parse_dates = [0,1], 
                          index_col = 0, 
                          date_parser=parse)

但错误发生了。它显示:

 File "D:\Anaconda2\lib\_strptime.py", line 332, in _strptime
  (data_string, format))

   ValueError: time data '100' does not match format '%Y%m%d'

我该如何处理?

1 个答案:

答案 0 :(得分:0)

您可以使用to_timedelta,因为必要除以100:

content=pd.read_excel(filepath,skiprows=1,
                      na_values=['nan',-9999.0,9999.0,
                      '-9999.0 -',-99,'-99.000 -',-999],
                      parse_cols=[1,2,3,4,5,6,7,8,9,10,11,12,14],
                      header=None, parse_dates = [0], 
                      index_col = 0)

df.index = df.index + pd.to_timedelta(df['start time(utc)'] / 100., unit='h')
df = df.drop('start time(utc)', axis=1)

如果没有必要(小时为0,1,2..23),请将parse_dates = [0,1]更改为parse_dates = [[0,1]]

<强>示例

import pandas as pd
from pandas.compat import StringIO

temp=u"""start day(utc);start time(utc);a
20160401;1;1
20160401;2;7
20160401;3;7
20160401;4;5
20160401;5;3"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
parse = lambda x: datetime.strptime(x, '%Y%m%d %H')
df = pd.read_csv(StringIO(temp), sep=";", 
                          parse_dates = [[0,1]], 
                          index_col = 0,
                          date_parser=parse)

print (df)
                                a
start day(utc)_start time(utc)   
2016-04-01 01:00:00             1
2016-04-01 02:00:00             7
2016-04-01 03:00:00             7
2016-04-01 04:00:00             5
2016-04-01 05:00:00             3

print (df.index)
DatetimeIndex(['2016-04-01 01:00:00', '2016-04-01 02:00:00',
               '2016-04-01 03:00:00', '2016-04-01 04:00:00',
               '2016-04-01 05:00:00'],
              dtype='datetime64[ns]', name='start day(utc)_start time(utc)', freq=None)