使用set_levels函数处理DST时间[multiindex pandas]

时间:2014-10-30 18:08:33

标签: python pandas multi-index

我观察到一个奇怪的观察结果,函数index.set_levels为pandas 0.15.0。 当我通过推断将时区改为欧洲/巴黎时,10月2日凌晨两倍。

有人解决这个问题吗?

array = [('s001', d) for d in pd.date_range(start='01/01/2014', end='01/01/2015', freq='H')] + [('s002', d) for d in pd.date_range(start='01/01/2014', end='01/01/2015', freq='H')]
index = pd.MultiIndex.from_tuples(array, names=['sce', 'DATES'])
df = pd.DataFrame(np.random.randn(len(index)), index=index)
df = df.append(df.query('DATES == "26/10/2014 02:00:00"'))
df = df.query('DATES <> "30/03/2014 02:00:00"')
df = df.sort()
df[7151:7160]

输出是:

sce  DATES                        
s001 2014-10-26 00:00:00  0.342909
     2014-10-26 01:00:00 -0.575897
     2014-10-26 02:00:00 -1.469307   <<<< ok
     2014-10-26 02:00:00 -1.469307   <<<< ok
     2014-10-26 03:00:00 -1.277365
     2014-10-26 04:00:00  1.252814

然后:

df.index  = df.index.set_levels(df.index.get_level_values(1).tz_localize('Europe/Paris', ambiguous = 'infer'), level=1)
df[7151:7160]

输出是:

sce  DATES                              
s001 2014-10-26 01:00:00+02:00  0.342909
     2014-10-26 02:00:00+02:00 -0.575897   <<<< nok   
     2014-10-26 02:00:00+01:00 -1.469307   <<<< ok
     2014-10-26 02:00:00+01:00 -1.469307   <<<< nok
     2014-10-26 03:00:00+01:00 -1.277365
     2014-10-26 04:00:00+01:00  1.252814

然后,如果我通过简单的索引

df = df.reset_index('sce')
df = df.tz_localize('Europe/Paris', ambiguous = 'infer')
df = df.set_index('sce', append=True)
df[7151:7160]

输出是:

DATES                     sce           
2014-10-26 00:00:00+02:00 s001  0.342909   <<<< ok
2014-10-26 01:00:00+02:00 s001 -0.575897   <<<< ok
2014-10-26 02:00:00+02:00 s001 -1.469307   <<<< ok
2014-10-26 02:00:00+01:00 s001 -1.469307   <<<< ok
2014-10-26 03:00:00+01:00 s001 -1.277365   <<<< ok
2014-10-26 04:00:00+01:00 s001  1.252814   <<<< ok

第二种方法给出了很好的结果,但是对于大型多索引数据帧来说却是非常长的(日期为16000,场景为200)

1 个答案:

答案 0 :(得分:1)

这是一个错误,请参阅问题here

这些作品是一种解决方法,我认为这可能是一个错误,因为关卡本身并没有正确地推断出模糊的时区。

In [91]: def works(df):
   ....:     return df.reset_index(level=1,drop=True).set_index(df.index.get_level_values(1).tz_localize('Europe/Paris', ambiguous = 'infer'),append=True).iloc[7151:7160]
   ....: 

In [92]: def breaks(df):
   ....:     return df.set_index(df.index.set_levels(df.index.get_level_values(1).tz_localize('Europe/Paris', ambiguous = 'infer'),level=1)).iloc[7151:7160]
   ....: 

In [93]: array = [('s001', d) for d in pd.date_range(start='01/01/2014', end='01/01/2015', freq='H')] + [('s002', d) for d in pd.date_range(start='01/01/2014', end='01/01/2015', freq='H')]

In [94]: index = pd.MultiIndex.from_tuples(array, names=['sce', 'DATES'])

In [95]: df = pd.DataFrame(np.random.randn(len(index)), index=index)

In [96]: df = df.append(df.query('DATES == "26/10/2014 02:00:00"'))

In [97]: df = df.query('DATES <> "30/03/2014 02:00:00"')

In [98]: df = df.sort()

In [99]: works(df)
Out[99]: 
                                       0
sce  DATES                              
s001 2014-10-26 00:00:00+02:00 -0.833819
     2014-10-26 01:00:00+02:00 -1.190427
     2014-10-26 02:00:00+02:00 -1.210017
     2014-10-26 02:00:00+01:00 -1.210017
     2014-10-26 03:00:00+01:00  0.763599
     2014-10-26 04:00:00+01:00 -1.055695
     2014-10-26 05:00:00+01:00 -0.912766
     2014-10-26 06:00:00+01:00  0.373625
     2014-10-26 07:00:00+01:00  0.631287

In [100]: breaks(df)
Out[100]: 
                                       0
sce  DATES                              
s001 2014-10-26 01:00:00+02:00 -0.833819
     2014-10-26 02:00:00+02:00 -1.190427
     2014-10-26 02:00:00+01:00 -1.210017
     2014-10-26 02:00:00+01:00 -1.210017
     2014-10-26 03:00:00+01:00  0.763599
     2014-10-26 04:00:00+01:00 -1.055695
     2014-10-26 05:00:00+01:00 -0.912766
     2014-10-26 06:00:00+01:00  0.373625
     2014-10-26 07:00:00+01:00  0.631287