我有一个大数据框(df),开始时看起来像这样:
date,number
2015-12-28,161
2015-12-29,225
2015-12-30,197
2016-06-06,217
2016-06-07,301
2016-06-08,317
2016-06-09,338
2016-06-10,308
2016-10-24,108
2016-10-25,142
2016-10-26,162
2016-10-27,165
2016-10-28,141
2016-01-04,193
2016-01-05,249
2016-01-06,263
2016-01-07,266
2016-01-08,248
2017-01-23,121
这是通过循环遍历多个目录,打开特定文件并在其中分组数据来实现的。每个目录通过以下用于生成此数据的代码创建最终df_final
数据帧的一部分:
def main():
folder = 'path'
frames = []
df_final = pd.DataFrame()
for dirname, dirs, files in os.walk(folder):
for filename in files:
filename_without_extension, extension = os.path.splitext(filename)
if filename_without_extension == 'portfolio-trade-pos-info':
df = pd.read_csv(dirname + '/' +filename, index_col = 'date' )
trades = df.groupby('date')[['trade']].count()
frames.append(trades)
df_final = df_final.append(df)
df_final.index_col = 'date'
df_final.sort_index()
final = pd.concat(frames)
final.sort_values('date')
final.to_csv('trades-per-day.csv', index=True)
我遇到了错误:
Traceback (most recent call last):
File "./trades_per_day.py", line 54, in <module>
main()
File "./trades_per_day.py", line 33, in main
trades = df.groupby('date')[['trade']].count()
File "/usr/local/lib64/python2.7/site-packages/pandas/core/generic.py", line 3991, in groupby
**kwargs)
File "/usr/local/lib64/python2.7/site-packages/pandas/core/groupby.py", line 1511, in groupby
return klass(obj, by, **kwds)
File "/usr/local/lib64/python2.7/site-packages/pandas/core/groupby.py", line 370, in __init__
mutated=self.mutated)
File "/usr/local/lib64/python2.7/site-packages/pandas/core/groupby.py", line 2462, in _get_grouper
in_axis, name, gpr = True, gpr, obj[gpr]
File "/usr/local/lib64/python2.7/site-packages/pandas/core/frame.py", line 2059, in __getitem__
return self._getitem_column(key)
File "/usr/local/lib64/python2.7/site-packages/pandas/core/frame.py", line 2066, in _getitem_column
return self._get_item_cache(key)
File "/usr/local/lib64/python2.7/site-packages/pandas/core/generic.py", line 1386, in _get_item_cache
values = self._data.get(item)
File "/usr/local/lib64/python2.7/site-packages/pandas/core/internals.py", line 3543, in get
loc = self.items.get_loc(item)
File "/usr/local/lib64/python2.7/site-packages/pandas/indexes/base.py", line 2136, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/index.pyx", line 132, in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)
File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)
File "pandas/src/hashtable_class_helper.pxi", line 732, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)
File "pandas/src/hashtable_class_helper.pxi", line 740, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)
KeyError: 'date
是否可以将df_final
中的数据框索引的数据类型更改为date
,以便可以按日期顺序对数据框进行排序?
因此上述输出将被排序:
date number
28/12/2015 161
29/12/2015 225
30/12/2015 197
04/01/2016 193
05/01/2016 249
06/01/2016 263
07/01/2016 266
08/01/2016 248
06/06/2016 217
07/06/2016 301
08/06/2016 317
09/06/2016 338
10/06/2016 308
24/10/2016 108
25/10/2016 142
26/10/2016 162
27/10/2016 165
28/10/2016 141
23/01/2017 121
谢谢
答案 0 :(得分:1)
在parse_dates
中使用pd.read_csv
参数。
MCVE:
from io import StringIO
csvfile = StringIO("""date,number
2015-12-28,161
2015-12-29,225
2015-12-30,197
2016-06-06,217
2016-06-07,301
2016-06-08,317
2016-06-09,338
2016-06-10,308
2016-10-24,108
2016-10-25,142
2016-10-26,162
2016-10-27,165
2016-10-28,141
2016-01-04,193
2016-01-05,249
2016-01-06,263
2016-01-07,266
2016-01-08,248
2017-01-23,121""")
df = pd.read_csv(csvfile, parse_dates=['date'])
df.info()
输出:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 2 columns):
date 19 non-null datetime64[ns]
number 19 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 384.0 bytes