Pandas:将操作应用于MultiIndex

时间:2016-05-27 19:08:51

标签: python pandas dataframe

我有MultiColumns:第二级重复包含Job OpeningsHires。我想为每个顶级列减去一个 - 但我尝试的所有内容都会让我陷入索引错误或切片错误。我该如何计算呢?

示例数据:

>>> df.head()
Out[25]: 
           Total nonfarm              Total private               
                   Hires Job openings         Hires Job openings   
date                                                               
2001-01-01          5777         5385          5419         4887   
2002-01-01          4849         3759          4539         3381   
2003-01-01          4971         3824          4645         3424   
2004-01-01          4827         3459          4552         3153   
2005-01-01          5207         3670          4876         3358  

预期产出:

Out[25]: 
           Total nonfarm   Total private              
              difference      difference   
date                                                               
2001-01-01          1234            5678          
2002-01-01          1234            5678          
2003-01-01          1234            5678         
2004-01-01          1234            5678      
2005-01-01          1234            5678    

其中数字显然不正确。

特别是在apply()

为了有一个普遍适用的方式,我试图设置

def apply(group):
    result = group.loc[:, pd.IndexSlice[:, 'Job openings']].div(group.loc[:, pd.IndexSlice[:, 'Hires']].values)
    result.columns = pd.MultiIndex.from_product([[group.columns.get_level_values(0)[0]], ['Ratio']])
    return result.values
foo = df.groupby(axis=1, level=0).apply(apply)

其中有两个问题:

  • 我需要使用.values作弊才能正确划分
  • foo不是正确的数据框:

    住宿和食品服务[[0.76],[0.480349344978],[0.501388888889],[... 艺术,娱乐和娱乐[[0.558139534884],[0.46017699115],[0.2483221 ... 建设[[0.35],[0.274881516588],[0.267260579065],[...

我首先尝试返回result,而不是result.values,但这只会导致数据框中充满NaN

特别是使用列名

对于投票得最高的答案,我不喜欢它需要在.diff().div() - 黑客,这使得代码难以阅读,并且很难实现#39;在子级别上有两列以上。

5 个答案:

答案 0 :(得分:3)

设置

import pandas as pd

df = pd.DataFrame(
    [
        [5777, 5385, 5419, 4887],
        [4849, 3759, 4539, 3381],
        [4971, 3824, 4645, 3424],
        [4827, 3459, 4552, 3153],
        [5207, 3670, 4876, 3358],
    ],
    index=pd.to_datetime(['2001-01-01',
                          '2002-01-01',
                          '2003-01-01',
                          '2004-01-01',
                          '2005-01-01']),
    columns=pd.MultiIndex.from_tuples(
        [('Total nonfarm', 'Hires'), ('Total nonfarm', 'Job Openings'),
         ('Total private', 'Hires'), ('Total private', 'Job Openings')]
    )
)

print df

           Total nonfarm              Total private             
                   Hires Job Openings         Hires Job Openings
2001-01-01          5777         5385          5419         4887
2002-01-01          4849         3759          4539         3381
2003-01-01          4971         3824          4645         3424
2004-01-01          4827         3459          4552         3153
2005-01-01          5207         3670          4876         3358

尝试:

df.T.groupby(level=0).diff(-1).dropna().T

           Total nonfarm Total private
                   Hires         Hires
2001-01-01         392.0         532.0
2002-01-01        1090.0        1158.0
2003-01-01        1147.0        1221.0
2004-01-01        1368.0        1399.0
2005-01-01        1537.0        1518.0

要应用其他变换,比如比例,您可以这样做:

print df.T.groupby(level=0).apply(lambda x: np.exp(np.log(x).diff(-1))).dropna().T

           Total nonfarm Total private
                   Hires         Hires
2001-01-01      1.072795      1.108860
2002-01-01      1.289971      1.342502
2003-01-01      1.299948      1.356600
2004-01-01      1.395490      1.443704
2005-01-01      1.418801      1.452055

或者:

print df.T.groupby(level=0).apply(lambda x: x.div(x.shift(-1))).dropna().T

           Total nonfarm Total private
                   Hires         Hires
2001-01-01      1.072795      1.108860
2002-01-01      1.289971      1.342502
2003-01-01      1.299948      1.356600
2004-01-01      1.395490      1.443704
2005-01-01      1.418801      1.452055

要重命名列并与原始数据帧合并,您可以:

df2 = df.T.groupby(level=0).diff(-1).dropna().T
df2.columns = pd.MultiIndex.from_tuples(
    [('Total nonfarm', 'difference'),
     ('Total private', 'difference')])
pd.concat([df, df2], axis=1).sort_index(axis=1)

看起来像:

           Total nonfarm                         Total private               \
                   Hires Job Openings difference         Hires Job Openings   
2001-01-01          5777         5385      392.0          5419         4887   
2002-01-01          4849         3759     1090.0          4539         3381   
2003-01-01          4971         3824     1147.0          4645         3424   
2004-01-01          4827         3459     1368.0          4552         3153   
2005-01-01          5207         3670     1537.0          4876         3358   

           difference  
2001-01-01      532.0  
2002-01-01     1158.0  
2003-01-01     1221.0  
2004-01-01     1399.0  
2005-01-01     1518.0  

答案 1 :(得分:2)

我认为您可以使用IndexSlice

idx = pd.IndexSlice
df[('Total private','difference')] = (df.loc[:, idx[('Total nonfarm', 'Hires')]] - 
                                      df.loc[:, idx[('Total private', 'Hires')]])
print (df)
           Total nonfarm              Total private                        
date               Hires Job openings         Hires Job openings difference
2001-01-01          5777         5385          5419         4887        358
2002-01-01          4849         3759          4539         3381        310
2003-01-01          4971         3824          4645         3424        326
2004-01-01          4827         3459          4552         3153        275
2005-01-01          5207         3670          4876         3358        331

如果您想要多列,可以使用修改后的piRSquared's answer - 您可以删除转置:

print (df.groupby(level=0,axis=1).diff(-1).dropna(1))
           Total nonfarm Total private             
date               Hires         Hires Job openings
2001-01-01         392.0         532.0       4495.0
2002-01-01        1090.0        1158.0       2291.0
2003-01-01        1147.0        1221.0       2277.0
2004-01-01        1368.0        1399.0       1785.0
2005-01-01        1537.0        1518.0       1821.0

答案 2 :(得分:1)

让我们保持简单。

In [19]: df['Total nonfarm'] - df['Total private']
Out[19]: 
            Hires  Job Openings
2001-01-01    358           498
2002-01-01    310           378
2003-01-01    326           400
2004-01-01    275           306
2005-01-01    331           312

答案 3 :(得分:1)

另一种方法是交换列级别并使用列访问器。

设置

import pandas as pd

df = pd.DataFrame(
    [
        [5777, 5385, 5419, 4887],
        [4849, 3759, 4539, 3381],
        [4971, 3824, 4645, 3424],
        [4827, 3459, 4552, 3153],
        [5207, 3670, 4876, 3358],
    ],
    index=pd.to_datetime(['2001-01-01',
                          '2002-01-01',
                          '2003-01-01',
                          '2004-01-01',
                          '2005-01-01']),
    columns=pd.MultiIndex.from_tuples(
        [('Total nonfarm', 'Hires'), ('Total nonfarm', 'Job Openings'),
         ('Total private', 'Hires'), ('Total private', 'Job Openings')]
    )
)

print df
           Total nonfarm              Total private             
                   Hires Job Openings         Hires Job Openings
2001-01-01          5777         5385          5419         4887
2002-01-01          4849         3759          4539         3381
2003-01-01          4971         3824          4645         3424
2004-01-01          4827         3459          4552         3153
2005-01-01          5207         3670          4876         3358

如果我们交换等级然后排序,它看起来像:

print df.swaplevel(0, 1, 1).sort_index(axis=1)

                   Hires                Job Openings              
           Total nonfarm Total private Total nonfarm Total private
2001-01-01          5777          5419          5385          4887
2002-01-01          4849          4539          3759          3381
2003-01-01          4971          4645          3824          3424
2004-01-01          4827          4552          3459          3153
2005-01-01          5207          4876          3670          3358

有了这个,我们可以通过.Hires['Hires']访问招聘人员。将此与您的减法需求相结合:

print df.swaplevel(0, 1, 1)['Hires']

            Total nonfarm  Total private
2001-01-01           5777           5419
2002-01-01           4849           4539
2003-01-01           4971           4645
2004-01-01           4827           4552
2005-01-01           5207           4876

print df.swaplevel(0, 1, 1)['Hires'] - df.swaplevel(0, 1, 1)['Job Openings']

            Total nonfarm  Total private
2001-01-01            392            532
2002-01-01           1090           1158
2003-01-01           1147           1221
2004-01-01           1368           1399
2005-01-01           1537           1518

解决方案

把所有这些放在一起,我做了:

df_ = df.swaplevel(0, 1, 1)

_df = pd.concat([
        df_,
        pd.concat([df_['Hires'] - df_['Job Openings'], df_['Hires'] / df_['Job Openings']],
                 axis=1, keys=['Difference', 'Ratio'])
    ], axis=1)

df = _df.swaplevel(0, 1, 1).sort_index(axis=1)

print df

           Total nonfarm                              Total private        \
              Difference Hires Job Openings     Ratio    Difference Hires   
2001-01-01           392  5777         5385  1.072795           532  5419   
2002-01-01          1090  4849         3759  1.289971          1158  4539   
2003-01-01          1147  4971         3824  1.299948          1221  4645   
2004-01-01          1368  4827         3459  1.395490          1399  4552   
2005-01-01          1537  5207         3670  1.418801          1518  4876   


           Job Openings     Ratio  
2001-01-01         4887  1.108860  
2002-01-01         3381  1.342502  
2003-01-01         3424  1.356600  
2004-01-01         3153  1.443704  
2005-01-01         3358  1.452055 

您还可以使用xs抓取横截面。

kw = dict(axis=1, level=1)

df.xs('Hires', **kw) - df.xs('Job Openings', **kw)

            Total nonfarm  Total private
2001-01-01            392            532
2002-01-01           1090           1158
2003-01-01           1147           1221
2004-01-01           1368           1399
2005-01-01           1537           1518

答案 4 :(得分:1)

使用groupbyapply

设置

import pandas as pd

df = pd.DataFrame(
    [
        [5777, 5385, 5419, 4887],
        [4849, 3759, 4539, 3381],
        [4971, 3824, 4645, 3424],
        [4827, 3459, 4552, 3153],
        [5207, 3670, 4876, 3358],
    ],
    index=pd.to_datetime(['2001-01-01',
                          '2002-01-01',
                          '2003-01-01',
                          '2004-01-01',
                          '2005-01-01']),
    columns=pd.MultiIndex.from_tuples(
        [('Total nonfarm', 'Hires'), ('Total nonfarm', 'Job Openings'),
         ('Total private', 'Hires'), ('Total private', 'Job Openings')]
    )
)

print df

解决方案

def diff(group):
    g = group.shift().sub(group).dropna()
    g.index = ['Difference']
    return g

def ratio(group):
    g = group.shift().div(group).dropna()
    g.index = ['Ratio']
    return g

def do_nothing(group):
    return group

pd.concat(
    [df.T.groupby(level=0).apply(f).T for f in [diff, ratio, do_nothing]],
    axis=1
).sort_index(axis=1)

           Total nonfarm                          Total private        \
              Difference Hires Job Openings Ratio    Difference Hires   
2001-01-01         392.0  5777         5385  1.07         532.0  5419   
2002-01-01        1090.0  4849         3759  1.29        1158.0  4539   
2003-01-01        1147.0  4971         3824  1.30        1221.0  4645   
2004-01-01        1368.0  4827         3459  1.40        1399.0  4552   
2005-01-01        1537.0  5207         3670  1.42        1518.0  4876   


           Job Openings Ratio  
2001-01-01         4887  1.11  
2002-01-01         3381  1.34  
2003-01-01         3424  1.36  
2004-01-01         3153  1.44  
2005-01-01         3358  1.45