使用pandas指定日期限制的奇怪行为

时间:2015-05-21 12:18:36

标签: python pandas

我有一个pandas.DataFrame对象,由日期时间索引,通过pandas.read_csv获得。数据的频率是10分钟。

我想从2014-06-15 00:00:002014-07-01 00:00:00选择一段时间。当我这样说时

a=df["2014-06-15 00:00:00":"2014-07-01 00:00:00"]

数据实际上从2014-06-15 00:10:00开始,而不是2014-06-15 00:00:00。但是,如果我写的话

a=df["2014-06-15 00:00":"2014-07-01 00:00"]

(“省略”秒),然后我得到预期的行为,即从2014-06-15 00:00:00开始的数据。我错过了什么吗?我使用的是pandas版本0.16.0。

修改

MWE数据:

a,b,c
2014-06-14 23:10,       3.809,  103.0
2014-06-14 23:20,       2.935,  83.0
2014-06-14 23:30,       1.923,  73.0
2014-06-14 23:40,       2.843,  89.0
2014-06-14 23:50,       1.785,  125.0
2014-06-15 00:00,       2.383,  114.0
2014-06-15 00:10,       3.717,  94.0
2014-06-15 00:20,       5.005,  91.0
2014-06-15 00:30,       3.901,  97.0
2014-06-15 00:40,       3.395,  98.0
2014-06-15 00:50,       1.095,  36.0
2014-06-15 01:00,       2.383,  67.0
2014-06-15 01:10,       2.199,  98.0
2014-06-15 01:20,       3.533,  82.0
2014-06-15 01:30,       1.969,  81.0
2014-06-15 01:40,       2.705,  78.0
2014-06-15 01:50,       3.579,  52.0
2014-06-15 02:00,       2.613,  81.0
2014-06-15 02:10,       3.671,  71.0
2014-06-15 02:20,       4.591,  94.0
2014-06-15 02:30,       4.499,  84.0
2014-06-15 02:40,       2.383,  26.0
2014-06-15 02:50,       1.555,  86.0
2014-06-15 03:00,       2.061,  179.0
2014-06-15 03:10,       1.693,  299.0
2014-06-15 03:20,       2.705,  114.0
2014-06-15 03:30,       1.647,  104.0
2014-06-15 03:40,       3.027,  306.0

MWE代码:

import pandas as pd
df=pd.read_csv("mwe.csv", index_col=0)
a=df["2014-06-15 00:00:00":]
print a

PS:我刚刚发现此代码在pandas 0.14下无效。

1 个答案:

答案 0 :(得分:1)

当像这样解析csv时(不指定parse_dates参数):

df = pd.read_csv("mwe.csv", index_col=0)

没有尝试将字符串解析为日期。因此Index有dtype object,索引中的值是字符串。

In [45]: df.index
Out[45]: Index([u'2014-06-14 23:10', u'2014-06-14 23:20', u'2014-06-14 23:30', u'2014-06-14 23:40', u'2014-06-14 23:50', u'2014-06-15 00:00', u'2014-06-15 00:10', u'2014-06-15 00:20', u'2014-06-15 00:30', u'2014-06-15 00:40', u'2014-06-15 00:50', u'2014-06-15 01:00', u'2014-06-15 01:10', u'2014-06-15 01:20', u'2014-06-15 01:30', u'2014-06-15 01:40', u'2014-06-15 01:50', u'2014-06-15 02:00', u'2014-06-15 02:10', u'2014-06-15 02:20', u'2014-06-15 02:30', u'2014-06-15 02:40', u'2014-06-15 02:50', u'2014-06-15 03:00', u'2014-06-15 03:10', u'2014-06-15 03:20', u'2014-06-15 03:30', u'2014-06-15 03:40'], dtype='object')

字符"2014-06-15 00:00:00"适用于u'2014-06-15 00:00'u'2014-06-15 00:10',因为strings are ordered lexicographicallyu < v if u is a prefix of v

In [49]: u'2014-06-15 00:00' < u"2014-06-15 00:00:00" < u'2014-06-15 00:10'
Out[49]: True

(在内部,字符串在进行比较之前转换为unicode。)

解决此问题的方法是将类似日期的字符串解析为实际日期:

df = pd.read_csv("mwe.csv", index_col=0)
df.index = pd.DatetimeIndex(df.index)

df = pd.read_csv("mwe.csv", index_col=0, parse_dates=[0])

然后df["2014-06-15 00:00:00":]df["2014-06-15 00:00":]都会返回预期结果:

In [57]: df["2014-06-15 00:00:00":].index[0]
Out[57]: Timestamp('2014-06-15 00:00:00')

In [58]: df["2014-06-15 00:00":].index[0]
Out[58]: Timestamp('2014-06-15 00:00:00')