从字符串到日期的数据帧索引

时间:2018-07-23 21:09:22

标签: python pandas parsing

我有一个大数据框(df),开始时看起来像这样:

date,number
2015-12-28,161
2015-12-29,225
2015-12-30,197
2016-06-06,217
2016-06-07,301
2016-06-08,317
2016-06-09,338
2016-06-10,308
2016-10-24,108
2016-10-25,142
2016-10-26,162
2016-10-27,165
2016-10-28,141
2016-01-04,193
2016-01-05,249
2016-01-06,263
2016-01-07,266
2016-01-08,248
2017-01-23,121

这是通过循环遍历多个目录,打开特定文件并在其中分组数据来实现的。每个目录通过以下用于生成此数据的代码创建最终df_final数据帧的一部分:

def main():


folder = 'path'
frames = []
df_final = pd.DataFrame()

for dirname, dirs, files in os.walk(folder):
    for filename in files:
        filename_without_extension, extension = os.path.splitext(filename)
        if filename_without_extension == 'portfolio-trade-pos-info':


            df = pd.read_csv(dirname + '/' +filename, index_col = 'date' )

            trades = df.groupby('date')[['trade']].count()
            frames.append(trades)

            df_final = df_final.append(df)
            df_final.index_col = 'date'
            df_final.sort_index()

final = pd.concat(frames)
final.sort_values('date')
final.to_csv('trades-per-day.csv', index=True)

我遇到了错误:

Traceback (most recent call last):
  File "./trades_per_day.py", line 54, in <module>
    main()
  File "./trades_per_day.py", line 33, in main
    trades = df.groupby('date')[['trade']].count()
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/generic.py", line 3991, in groupby
    **kwargs)
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/groupby.py", line 1511, in groupby
    return klass(obj, by, **kwds)
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/groupby.py", line 370, in __init__
    mutated=self.mutated)
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/groupby.py", line 2462, in _get_grouper
    in_axis, name, gpr = True, gpr, obj[gpr]
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/frame.py", line 2059, in __getitem__
    return self._getitem_column(key)
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/frame.py", line 2066, in _getitem_column
    return self._get_item_cache(key)
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/generic.py", line 1386, in _get_item_cache
    values = self._data.get(item)
  File "/usr/local/lib64/python2.7/site-packages/pandas/core/internals.py", line 3543, in get
    loc = self.items.get_loc(item)
  File "/usr/local/lib64/python2.7/site-packages/pandas/indexes/base.py", line 2136, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
  File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
  File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'date

是否可以将df_final中的数据框索引的数据类型更改为date,以便可以按日期顺序对数据框进行排序?

因此上述输出将被排序:

date    number
28/12/2015  161
29/12/2015  225
30/12/2015  197
04/01/2016  193
05/01/2016  249
06/01/2016  263
07/01/2016  266
08/01/2016  248
06/06/2016  217
07/06/2016  301
08/06/2016  317
09/06/2016  338
10/06/2016  308
24/10/2016  108
25/10/2016  142
26/10/2016  162
27/10/2016  165
28/10/2016  141
23/01/2017  121

谢谢

1 个答案:

答案 0 :(得分:1)

parse_dates中使用pd.read_csv参数。

MCVE:

from io import StringIO

csvfile = StringIO("""date,number
2015-12-28,161
2015-12-29,225
2015-12-30,197
2016-06-06,217
2016-06-07,301
2016-06-08,317
2016-06-09,338
2016-06-10,308
2016-10-24,108
2016-10-25,142
2016-10-26,162
2016-10-27,165
2016-10-28,141
2016-01-04,193
2016-01-05,249
2016-01-06,263
2016-01-07,266
2016-01-08,248
2017-01-23,121""")

df = pd.read_csv(csvfile, parse_dates=['date'])

df.info()

输出:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 2 columns):
date      19 non-null datetime64[ns]
number    19 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 384.0 bytes