熊猫multiindex添加新的列性能问题

时间:2019-12-03 18:51:24

标签: python pandas

我有一个这样的数据框:

                           Band1       lat1       lon1
latitude level longitude                              
41.0     1000  19.50        23.0  41.015335  19.548331
               19.50        44.0  41.015335  19.565497
               19.50        12.0  41.015335  19.582663
               19.75        35.0  41.015335  19.668494
               19.75        83.0  41.015335  19.685660

我想将以下列添加到multiIndex(这是DatetimeIndex类型):

DatetimeIndex(['1979-01-01 00:00:00', '1979-01-01 01:00:00',
               '1979-01-01 02:00:00', '1979-01-01 03:00:00',
               '1979-01-01 04:00:00', '1979-01-01 05:00:00',
               '1979-01-01 06:00:00', '1979-01-01 07:00:00',
               '1979-01-01 08:00:00', '1979-01-01 09:00:00',
               ...
               '2019-12-30 15:00:00', '2019-12-30 16:00:00',
               '2019-12-30 17:00:00', '2019-12-30 18:00:00',
               '2019-12-30 19:00:00', '2019-12-30 20:00:00',
               '2019-12-30 21:00:00', '2019-12-30 22:00:00',
               '2019-12-30 23:00:00', '2019-12-31 00:00:00'],
              dtype='datetime64[ns]', length=179305, freq=None)

我尝试了描述here的过程,但是它需要花费数小时的循环而没有结果(可能是由于行数很大-在这种情况下为179305)。 所需的结果将是:

                                                    Band1       lat1       lon1
latitude level longitude time                 
41.0     1000  19.50    '1979-01-01 00:00:00'       23.0  41.015335  19.548331
                        '1979-01-01 01:00:00'       23.0  41.015335  19.548331
                        '1979-01-01 02:00:00'       23.0  41.015335  19.548331
                        '1979-01-01 03:00:00'       23.0  41.015335  19.548331
                        '1979-01-01 04:00:00'       23.0  41.015335  19.548331
                        ...                         ...     ...         ...

               19.60    '1979-01-01 00:00:00'       44.0  41.015335  19.565497
                        '1979-01-01 01:00:00'       44.0  41.015335  19.565497
                        '1979-01-01 02:00:00'       44.0  41.015335  19.565497
                        '1979-01-01 03:00:00'       44.0  41.015335  19.565497
                        '1979-01-01 04:00:00'       44.0  41.015335  19.565497
                        ...                         ....
               19.65                                12.0  41.015335  19.582663
               19.75                                35.0  41.015335  19.668494
               19.75                                83.0  41.015335  19.685660
                                                    ...        ...        ...
46.5     850   23.00                                1280.0  46.491333  23.015891
               23.00                                1390.0  46.491333  23.033057
               23.00                                1508.0  46.491333  23.050223
               23.00                                1519.0  46.491333  23.067389
               23.00                                1544.0  46.491333  23.084556

主要问题是速度,因此for循环不是一个选择。任何帮助表示赞赏。

1 个答案:

答案 0 :(得分:2)

您想要append中的set_index选项:

# toy data
idx = pd.MultiIndex.from_arrays([list('aabbcc'), list('111111')], names=['x','y'])
df = pd.DataFrame(np.arange(18).reshape(-1,3), 
                  index=idx,
                  columns=list('abc'))

times = [11,22]

# calculate multiplicity of the last index
multi = len(df.index)//len(times)


df = (df.assign(time=np.tile(times, multi))         # replace [0,1,2,3,4] with your datetime series
        .set_index('time', append=True)
     )

输出:

           a   b   c
x y time            
a 1 11     0   1   2
    22     3   4   5
b 1 11     6   7   8
    22     9  10  11
c 1 11    12  13  14
    22    15  16  17