我观察到一个奇怪的观察结果,函数index.set_levels为pandas 0.15.0。 当我通过推断将时区改为欧洲/巴黎时,10月2日凌晨两倍。
有人解决这个问题吗?
array = [('s001', d) for d in pd.date_range(start='01/01/2014', end='01/01/2015', freq='H')] + [('s002', d) for d in pd.date_range(start='01/01/2014', end='01/01/2015', freq='H')]
index = pd.MultiIndex.from_tuples(array, names=['sce', 'DATES'])
df = pd.DataFrame(np.random.randn(len(index)), index=index)
df = df.append(df.query('DATES == "26/10/2014 02:00:00"'))
df = df.query('DATES <> "30/03/2014 02:00:00"')
df = df.sort()
df[7151:7160]
输出是:
sce DATES
s001 2014-10-26 00:00:00 0.342909
2014-10-26 01:00:00 -0.575897
2014-10-26 02:00:00 -1.469307 <<<< ok
2014-10-26 02:00:00 -1.469307 <<<< ok
2014-10-26 03:00:00 -1.277365
2014-10-26 04:00:00 1.252814
然后:
df.index = df.index.set_levels(df.index.get_level_values(1).tz_localize('Europe/Paris', ambiguous = 'infer'), level=1)
df[7151:7160]
输出是:
sce DATES
s001 2014-10-26 01:00:00+02:00 0.342909
2014-10-26 02:00:00+02:00 -0.575897 <<<< nok
2014-10-26 02:00:00+01:00 -1.469307 <<<< ok
2014-10-26 02:00:00+01:00 -1.469307 <<<< nok
2014-10-26 03:00:00+01:00 -1.277365
2014-10-26 04:00:00+01:00 1.252814
然后,如果我通过简单的索引
df = df.reset_index('sce')
df = df.tz_localize('Europe/Paris', ambiguous = 'infer')
df = df.set_index('sce', append=True)
df[7151:7160]
输出是:
DATES sce
2014-10-26 00:00:00+02:00 s001 0.342909 <<<< ok
2014-10-26 01:00:00+02:00 s001 -0.575897 <<<< ok
2014-10-26 02:00:00+02:00 s001 -1.469307 <<<< ok
2014-10-26 02:00:00+01:00 s001 -1.469307 <<<< ok
2014-10-26 03:00:00+01:00 s001 -1.277365 <<<< ok
2014-10-26 04:00:00+01:00 s001 1.252814 <<<< ok
第二种方法给出了很好的结果,但是对于大型多索引数据帧来说却是非常长的(日期为16000,场景为200)
答案 0 :(得分:1)
这是一个错误,请参阅问题here
这些作品是一种解决方法,我认为这可能是一个错误,因为关卡本身并没有正确地推断出模糊的时区。
In [91]: def works(df):
....: return df.reset_index(level=1,drop=True).set_index(df.index.get_level_values(1).tz_localize('Europe/Paris', ambiguous = 'infer'),append=True).iloc[7151:7160]
....:
In [92]: def breaks(df):
....: return df.set_index(df.index.set_levels(df.index.get_level_values(1).tz_localize('Europe/Paris', ambiguous = 'infer'),level=1)).iloc[7151:7160]
....:
In [93]: array = [('s001', d) for d in pd.date_range(start='01/01/2014', end='01/01/2015', freq='H')] + [('s002', d) for d in pd.date_range(start='01/01/2014', end='01/01/2015', freq='H')]
In [94]: index = pd.MultiIndex.from_tuples(array, names=['sce', 'DATES'])
In [95]: df = pd.DataFrame(np.random.randn(len(index)), index=index)
In [96]: df = df.append(df.query('DATES == "26/10/2014 02:00:00"'))
In [97]: df = df.query('DATES <> "30/03/2014 02:00:00"')
In [98]: df = df.sort()
In [99]: works(df)
Out[99]:
0
sce DATES
s001 2014-10-26 00:00:00+02:00 -0.833819
2014-10-26 01:00:00+02:00 -1.190427
2014-10-26 02:00:00+02:00 -1.210017
2014-10-26 02:00:00+01:00 -1.210017
2014-10-26 03:00:00+01:00 0.763599
2014-10-26 04:00:00+01:00 -1.055695
2014-10-26 05:00:00+01:00 -0.912766
2014-10-26 06:00:00+01:00 0.373625
2014-10-26 07:00:00+01:00 0.631287
In [100]: breaks(df)
Out[100]:
0
sce DATES
s001 2014-10-26 01:00:00+02:00 -0.833819
2014-10-26 02:00:00+02:00 -1.190427
2014-10-26 02:00:00+01:00 -1.210017
2014-10-26 02:00:00+01:00 -1.210017
2014-10-26 03:00:00+01:00 0.763599
2014-10-26 04:00:00+01:00 -1.055695
2014-10-26 05:00:00+01:00 -0.912766
2014-10-26 06:00:00+01:00 0.373625
2014-10-26 07:00:00+01:00 0.631287