我有一个简化的CSV文件,如下所示:
X,,Y,,Z,
Date,Time,A,B,A,B
2017-01-21,01:57:49.390,0,1,2,3
2017-01-21,01:57:50.400,4,5,7,9
2017-01-21,01:57:51.410,3,2,4,1
前两列是日期和时间。当我做“
pandas.read_csv('foo.csv', header=[0,1])
我得到以下DataFrame:
X Unnamed: 1_level_0 Y Unnamed: 3_level_0 Z Unnamed: 5_level_0
Date Time A B A B
0 2017-01-21 01:57:49.390 0 1 2 3
1 2017-01-21 01:57:50.400 4 5 7 9
2 2017-01-21 01:57:51.410 3 2 4 1
暂时忽略列中恼人的未命名条目,我想将前两列合并为一个日期时间。所以我尝试使用parse_dates参数:
pandas.read_csv('foo.csv', header=[0,1], parse_dates={'datetime': [0,1]})
但我从中得到的只是追溯:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 401, in _read
data = parser.read()
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 939, in read
ret = self._engine.read(nrows)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1585, in read
names, data = self._do_date_conversions(names, data)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1364, in _do_date_conversions
self.index_names, names, keep_date_col=self.keep_date_col)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 2737, in _process_date_conversion
data_dict.pop(c)
KeyError: "('X', 'Date')"
我不确定为什么它会在KeyError
上点击('X', 'Date')
,因为它们肯定存在于列中。我真的不知道这是pandas
中我应该报告的错误(我正在使用0.19.2),或者我只是不理解某些东西。有什么想法吗?
答案 0 :(得分:1)
如果需要,您可以随时解决:
import datetime as dt
import pandas as pd
# read in the csv file
df = pd.read_csv('foo.csv', header=[0, 1])
# get a label for the funky column names
date_label, time_label = tuple(df.columns.values)[0:2]
# merge the columns into a single datetime
dates = [
dt.datetime.strptime('T'.join(ts) + '000', '%Y-%m-%dT%H:%M:%S.%f')
for ts in zip(df[date_label], df[time_label])]
# save the new column
df['DateTime'] = pd.Series(dates).values
更新:
我已为此问题提交了bug和pull request。在错误的response中,jreback(pandas lead maintainer)对该示例中的多级标头问题给出了相当详细的响应。我想你已经意识到了这些问题,但你可能想看看他写的内容。在回复结束时,他有一点可以提供解决方法:
制作单个级别在多级框架中无用。我可能会这样做:
In [25]: pandas.read_csv(StringIO(data), header=0, skiprows=1, parse_dates={'datetime':[0,1]})
Out[25]:
datetime A B A.1 B.1
0 2017-01-21 01:57:49.390 0 1 2 3
1 2017-01-21 01:57:50.400 4 5 7 9
2 2017-01-21 01:57:51.410 3 2 4 1