parse_dates,pd.read_csv意外结果

时间:2016-12-03 19:38:55

标签: datetime pandas

我有一个csv文件,其中包含假日日期列表。

f5_name = r'C:\\holidays.csv'

holidays = pd.read_csv(f5_name, parse_dates=True)

您可以使用以下内容重现holidays数据框:

nerc_holidays.to_dict()

{'dt': {0: '2016-09-05',
  1: '2016-11-24',
  2: '2016-12-26',
  3: '2017-01-02',
  4: '2017-05-29',
  5: '2017-07-04',
  6: '2017-09-04',
  7: '2017-11-23',
  8: '2017-12-15',
  9: '2018-01-01',
  10: '2018-05-28',
  11: '2018-07-04',
  12: '2018-09-03',
  13: '2018-11-22',
  14: '2018-12-25'}}

您可以看到我将parse_dates = True参数添加到pd.read_csv()执行。

现在,我有另一个名为databasedf的数据帧。我想过滤databasedf,以便日期列(dt)的日期位于holiday数据框中。

当我运行以下内容时:

databasedf[databasedf['dt'].isin(holidays)]

我收到了这个错误:

TypeError                                 Traceback (most recent call last)
C:\Users\XXX\Anaconda3\lib\site-packages\pandas\tseries\tools.py in _convert_listlike(arg, box, format, name)
    408             try:
--> 409                 values, tz = tslib.datetime_to_datetime64(arg)
    410                 return DatetimeIndex._simple_new(values, name=name, tz=tz)

pandas\tslib.pyx in pandas.tslib.datetime_to_datetime64 (pandas\tslib.c:29768)()

TypeError: Unrecognized value type: <class 'str'>

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-519-b57c47ecb0e5> in <module>()
----> 1 databasedf[databasedf['dt'].isin(holidays)]

C:\Users\XXX\Anaconda3\lib\site-packages\pandas\core\series.py in isin(self, values)
   2413 
   2414         """
-> 2415         result = algos.isin(_values_from_object(self), values)
   2416         return self._constructor(result, index=self.index).__finalize__(self)
   2417 

C:\Users\XXX\Anaconda3\lib\site-packages\pandas\core\algorithms.py in isin(comps, values)
    129     if com.is_datetime64_dtype(comps):
    130         from pandas.tseries.tools import to_datetime
--> 131         values = to_datetime(values)._values.view('i8')
    132         comps = comps.view('i8')
    133     elif com.is_timedelta64_dtype(comps):

C:\Users\XXX\Anaconda3\lib\site-packages\pandas\util\decorators.py in wrapper(*args, **kwargs)
     89                 else:
     90                     kwargs[new_arg_name] = new_arg_value
---> 91             return func(*args, **kwargs)
     92         return wrapper
     93     return _deprecate_kwarg

C:\Users\XXX\Anaconda3\lib\site-packages\pandas\tseries\tools.py in to_datetime(arg, errors, dayfirst, yearfirst, utc, box, format, exact, coerce, unit, infer_datetime_format)
    289                         yearfirst=yearfirst,
    290                         utc=utc, box=box, format=format, exact=exact,
--> 291                         unit=unit, infer_datetime_format=infer_datetime_format)
    292 
    293 

C:\Users\XXX\Anaconda3\lib\site-packages\pandas\tseries\tools.py in _to_datetime(arg, errors, dayfirst, yearfirst, utc, box, format, exact, unit, freq, infer_datetime_format)
    425         return _convert_listlike(arg, box, format, name=arg.name)
    426     elif com.is_list_like(arg):
--> 427         return _convert_listlike(arg, box, format)
    428 
    429     return _convert_listlike(np.array([arg]), box, format)[0]

C:\Users\XXX\Anaconda3\lib\site-packages\pandas\tseries\tools.py in _convert_listlike(arg, box, format, name)
    410                 return DatetimeIndex._simple_new(values, name=name, tz=tz)
    411             except (ValueError, TypeError):
--> 412                 raise e
    413 
    414     if arg is None:

C:\Users\XXX\Anaconda3\lib\site-packages\pandas\tseries\tools.py in _convert_listlike(arg, box, format, name)
    396                     yearfirst=yearfirst,
    397                     freq=freq,
--> 398                     require_iso8601=require_iso8601
    399                 )
    400 

pandas\tslib.pyx in pandas.tslib.array_to_datetime (pandas\tslib.c:41972)()

pandas\tslib.pyx in pandas.tslib.array_to_datetime (pandas\tslib.c:41577)()

pandas\tslib.pyx in pandas.tslib.array_to_datetime (pandas\tslib.c:41466)()

pandas\tslib.pyx in pandas.tslib.parse_datetime_string (pandas\tslib.c:31806)()

C:\Users\XXX\Anaconda3\lib\site-packages\dateutil\parser.py in parse(timestr, parserinfo, **kwargs)
   1162         return parser(parserinfo).parse(timestr, **kwargs)
   1163     else:
-> 1164         return DEFAULTPARSER.parse(timestr, **kwargs)
   1165 
   1166 

C:\Users\XXX\Anaconda3\lib\site-packages\dateutil\parser.py in parse(self, timestr, default, ignoretz, tzinfos, **kwargs)
    553 
    554         if res is None:
--> 555             raise ValueError("Unknown string format")
    556 
    557         if len(res) == 0:

ValueError: Unknown string format

{I}}功能仅在我执行以下操作后起作用:

.isin()

为什么我必须将值强制为datetime,而实际上我已经通过了holidays = pd.to_datetime(holidays['dt']) 中的parse_dates=True参数?

1 个答案:

答案 0 :(得分:1)

我认为如果需要输出index_col,您也可以将参数dt与列parse_datesDateTimes一起使用:

import pandas as pd
from pandas.compat import StringIO

temp=u"""dt
2016-09-05
2016-11-24
2016-12-26
2017-01-02"""
#after testing replace StringIO(temp) to f5_name
holidays  = pd.read_csv(StringIO(temp), index_col=['dt'], parse_dates=['dt'])

print (holidays.index)
DatetimeIndex(['2016-09-05', '2016-11-24', '2016-12-26', '2017-01-02'], dtype='datetime64[ns]', name='dt', freq=None)

如果需要输出为字符串列表:

import pandas as pd
import numpy as np
from pandas.compat import StringIO

temp=u"""dt
2016-09-05
2016-11-24
2016-12-26
2017-01-02"""
#after testing replace StringIO(temp) to filename
holidays  = pd.read_csv(StringIO(temp), index_col=['dt'])

print (holidays.index.tolist())
['2016-09-05', '2016-11-24', '2016-12-26', '2017-01-02']

您的代码中还需要holidays['dt'],因为需要选择嵌套的dictionary

parse_dates=True用于将转化索引转换为DatetimeIndex - 请参阅docs。但是如果没有设置DatetimeIndex,它似乎什么都不做:

temp=u"""dt
2016-09-05
2016-11-24
2016-12-26
2017-01-02"""
#after testing replace StringIO(temp) to filename
holidays  = pd.read_csv(StringIO(temp), parse_dates=True)

print (holidays)

           dt
0  2016-09-05
1  2016-11-24
2  2016-12-26
3  2017-01-02

print (type(holidays.loc[0,'dt']))
<class 'str'>

print (holidays.dt.to_dict())
{0: '2016-09-05', 1: '2016-11-24', 2: '2016-12-26', 3: '2017-01-02'}