嵌套的小计“全部”行与熊猫Pivot_table

时间:2019-03-31 12:04:30

标签: python pandas

我有一些像这样的长格式数据(请参见下面的重新创建):

>>> df
   section subsection   name topic     score
0        A          W  zwphf     a  0.802427
1        A          W  jcyyc     a  0.404077
2        A          W  kucem     a  0.367319
3        A          X  ldbxz     a  0.554260
4        A          X  vkcqh     a  0.265864
5        A          X  cvksn     a  0.548099
6        B          Y  spghx     a  0.472612
7        B          Y  cqokn     a  0.577504
8        B          Y  wjsxg     a  0.815309
9        B          Z  holoo     a  0.459850
10       B          Z  lnihf     a  0.667877
11       B          Z  wirhq     a  0.138879
12       A          W  zwphf     b  0.673711
13       A          W  jcyyc     b  0.507962
14       A          W  kucem     b  0.546055
15       A          X  ldbxz     b  0.148214
16       A          X  vkcqh     b  0.773320
17       A          X  cvksn     b  0.791990
18       B          Y  spghx     b  0.487480
19       B          Y  cqokn     b  0.252534
20       B          Y  wjsxg     b  0.237767
21       B          Z  holoo     b  0.432981
22       B          Z  lnihf     b  0.317932
23       B          Z  wirhq     b  0.614401

我想在section + subsection + name + topic上进行分组,并在topic上进行堆叠,但还要显示间歇性嵌套的“全部”小计行:

>>> result                                                                                                                                         
   section subsection   name         a         b
0        A        All    All  0.490341  0.573542
1        A          W    All  0.524608  0.575909
2        A          W  jcyyc  0.404077  0.507962
3        A          W  kucem  0.367319  0.546055
4        A          W  zwphf  0.802427  0.673711
5        A          X    All  0.456074  0.571174
6        A          X  cvksn  0.548099  0.791990
7        A          X  ldbxz  0.554260  0.148214
8        A          X  vkcqh  0.265864  0.773320
9        B        All    All  0.522005  0.390516
10       B          Y    All  0.621808  0.325927
11       B          Y  cqokn  0.577504  0.252534
12       B          Y  spghx  0.472612  0.487480
13       B          Y  wjsxg  0.815309  0.237767
14       B          Z    All  0.422202  0.455104
15       B          Z  holoo  0.459850  0.432981
16       B          Z  lnihf  0.667877  0.317932
17       B          Z  wirhq  0.138879  0.614401

通过突出显示新行,可以更容易地看到它:

enter image description here

最初的分组依据本身,不包括小计,

>>> df.groupby(['section', 'subsection', 'name', 'topic'])['score'].mean().unstack('topic')
topic                            a         b
section subsection name                     
A       W          jcyyc  0.404077  0.507962
                   kucem  0.367319  0.546055
                   zwphf  0.802427  0.673711
        X          cvksn  0.548099  0.791990
                   ldbxz  0.554260  0.148214
                   vkcqh  0.265864  0.773320
B       Y          cqokn  0.577504  0.252534
                   spghx  0.472612  0.487480
                   wjsxg  0.815309  0.237767
        Z          holoo  0.459850  0.432981
                   lnihf  0.667877  0.317932
                   wirhq  0.138879  0.614401

但是我不确定如何使用margins获取['section', 'topic']['section', 'subsection', 'topic']上groupby ops的小计。


要重新创建df

import pandas as pd
data = [['A', 'W', 'zwphf', 'a', 0.80242702],
        ['A', 'W', 'jcyyc', 'a', 0.40407741],
        ['A', 'W', 'kucem', 'a', 0.36731944],
        ['A', 'X', 'ldbxz', 'a', 0.55426007],
        ['A', 'X', 'vkcqh', 'a', 0.26586396],
        ['A', 'X', 'cvksn', 'a', 0.54809939],
        ['B', 'Y', 'spghx', 'a', 0.47261223],
        ['B', 'Y', 'cqokn', 'a', 0.57750357],
        ['B', 'Y', 'wjsxg', 'a', 0.81530899],
        ['B', 'Z', 'holoo', 'a', 0.45985020],
        ['B', 'Z', 'lnihf', 'a', 0.66787651],
        ['B', 'Z', 'wirhq', 'a', 0.13887864],
        ['A', 'W', 'zwphf', 'b', 0.67371101],
        ['A', 'W', 'jcyyc', 'b', 0.50796174],
        ['A', 'W', 'kucem', 'b', 0.54605544],
        ['A', 'X', 'ldbxz', 'b', 0.14821402],
        ['A', 'X', 'vkcqh', 'b', 0.77331968],
        ['A', 'X', 'cvksn', 'b', 0.79198960],
        ['B', 'Y', 'spghx', 'b', 0.48747995],
        ['B', 'Y', 'cqokn', 'b', 0.25253355],
        ['B', 'Y', 'wjsxg', 'b', 0.23776694],
        ['B', 'Z', 'holoo', 'b', 0.43298050],
        ['B', 'Z', 'lnihf', 'b', 0.31793156],
        ['B', 'Z', 'wirhq', 'b', 0.61440056]]
df = pd.DataFrame(data,
                  columns=['section', 'subsection', 'name', 'topic', 'score'])

要重新创建预期结果:

import numpy as np

result = np.array([['A', 'All', 'All', 0.490341219, 0.573541919],
                   ['A', 'W', 'All', 0.52460796, 0.5759094],
                   ['A', 'W', 'jcyyc', 0.404077415, 0.5079617479999999],
                   ['A', 'W', 'kucem', 0.36731944, 0.546055442],
                   ['A', 'W', 'zwphf', 0.8024270240000001, 0.673711011],
                   ['A', 'X', 'All', 0.45607447700000003, 0.571174437],
                   ['A', 'X', 'cvksn', 0.548099391, 0.791989603],
                   ['A', 'X', 'ldbxz', 0.554260074, 0.148214029],
                   ['A', 'X', 'vkcqh', 0.265863967, 0.77331968],
                   ['B', 'All', 'All', 0.5220050279999999, 0.390515513],
                   ['B', 'Y', 'All', 0.621808268, 0.325926816],
                   ['B', 'Y', 'cqokn', 0.577503576, 0.252533557],
                   ['B', 'Y', 'spghx', 0.472612233, 0.487479951],
                   ['B', 'Y', 'wjsxg', 0.815308995, 0.237766941],
                   ['B', 'Z', 'All', 0.42220178799999997, 0.455104209],
                   ['B', 'Z', 'holoo', 0.459850205, 0.43298050200000004],
                   ['B', 'Z', 'lnihf', 0.667876511, 0.317931565],
                   ['B', 'Z', 'wirhq', 0.13887864800000002, 0.61440056]], dtype=object)
result = pd.DataFrame(result, columns=['section', 'subsection', 'name', 'a', 'b'])

2 个答案:

答案 0 :(得分:2)

您需要:

s = df.groupby(['section', 'subsection', 'name', 'topic'])['score'].mean().unstack('topic')

s1 = (s.mean(level=0)
       .assign(subsection = 'All', name='All')
       .set_index(['subsection','name'], append=True))
s2 = (s.mean(level=[0, 1])
       .assign(name='All')
       .set_index(['name'], append=True))

s = pd.concat([s, s1, s2]).sort_index()

但是如果需要submeans不确定以上解决方案是否正确(均值),则更好:

s1 = df.groupby(['section','topic'])['score'].mean().unstack('topic').assign(subsection = 'All', name='All').set_index(['subsection','name'], append=True)
s2 = df.groupby(['section','subsection','topic'])['score'].mean().unstack('topic').assign(name='All').set_index(['name'], append=True)

s = pd.concat([s, s1, s2]).sort_index()
print (s)
topic                            a         b
section subsection name                     
A       All        All    0.490341  0.573542
        W          All    0.524608  0.575909
                   jcyyc  0.404077  0.507962
                   kucem  0.367319  0.546055
                   zwphf  0.802427  0.673711
        X          All    0.456074  0.571174
                   cvksn  0.548099  0.791990
                   ldbxz  0.554260  0.148214
                   vkcqh  0.265864  0.773320
B       All        All    0.522005  0.390516
        Y          All    0.621808  0.325927
                   cqokn  0.577504  0.252534
                   spghx  0.472612  0.487480
                   wjsxg  0.815309  0.237767
        Z          All    0.422202  0.455104
                   holoo  0.459850  0.432980
                   lnihf  0.667877  0.317932
                   wirhq  0.138879  0.614401

编辑:

如果需要订购-在这里tot可以使用All来安装ordered categoricals

cat1 = ['tot'] + df['subsection'].unique().tolist()
cat2 = ['tot'] + df['name'].unique().tolist()

df['subsection'] = pd.Categorical(df['subsection'], categories=cat1, ordered=True)
df['name'] = pd.Categorical(df['name'], categories=cat2, ordered=True)

s = df.groupby(['section', 'subsection', 'name', 'topic'])['score'].mean().unstack('topic')
s1 = (df.groupby(['section','topic'])['score'].mean()
        .unstack('topic').assign(subsection = 'tot', name='tot')
        .set_index(['subsection','name'], append=True))

s2 = (df.groupby(['section','subsection','topic'])['score'].mean()
        .unstack('topic')
        .assign(name='tot')
        .set_index(['name'], append=True))

s = pd.concat([s, s1, s2]).sort_index()

print (s)
topic                            a         b
section subsection name                     
A       tot        tot    0.490341  0.573542
        W          tot    0.524608  0.575909
                   zwphf  0.802427  0.673711
                   jcyyc  0.404077  0.507962
                   kucem  0.367319  0.546055
        X          tot    0.456074  0.571174
                   ldbxz  0.554260  0.148214
                   vkcqh  0.265864  0.773320
                   cvksn  0.548099  0.791990
B       tot        tot    0.522005  0.390516
        Y          tot    0.621808  0.325927
                   spghx  0.472612  0.487480
                   cqokn  0.577504  0.252534
                   wjsxg  0.815309  0.237767
        Z          tot    0.422202  0.455104
                   holoo  0.459850  0.432980
                   lnihf  0.667877  0.317932
                   wirhq  0.138879  0.614401

答案 1 :(得分:1)

您可以执行以下操作,首先我们创建三个groupby数据帧。 然后,我们将它们连接在一起,并用NaN填充All

df1 = df.groupby(['section', 'subsection', 'name', 'topic']).score.mean().unstack('topic').reset_index()

df2 = df.groupby(['section', 'subsection', 'topic']).score.mean().unstack('topic').reset_index()

df3 = df.groupby(['section', 'topic']).score.mean().unstack('topic').reset_index()

df_final = pd.concat([df1,df2,df3],axis=0, ignore_index=True, sort=True) \
           .fillna('All').sort_values(['section', 'subsection','name']) \
           .reset_index(drop=True)

print(df_final)
           a         b   name section subsection
0   0.490341  0.573542    All       A        All
1   0.524608  0.575909    All       A          W
2   0.404077  0.507962  jcyyc       A          W
3   0.367319  0.546055  kucem       A          W
4   0.802427  0.673711  zwphf       A          W
5   0.456074  0.571174    All       A          X
6   0.548099  0.791990  cvksn       A          X
7   0.554260  0.148214  ldbxz       A          X
8   0.265864  0.773320  vkcqh       A          X
9   0.522005  0.390516    All       B        All
10  0.621808  0.325927    All       B          Y
11  0.577504  0.252534  cqokn       B          Y
12  0.472612  0.487480  spghx       B          Y
13  0.815309  0.237767  wjsxg       B          Y
14  0.422202  0.455104    All       B          Z
15  0.459850  0.432980  holoo       B          Z
16  0.667877  0.317932  lnihf       B          Z
17  0.138879  0.614401  wirhq       B          Z

重新排序列以更正输出:

cols = ['section', 'subsection', 'name', 'a', 'b']

df_final = df_final[cols]
print(df_final)
   section subsection   name         a         b
0        A        All    All  0.490341  0.573542
1        A          W    All  0.524608  0.575909
2        A          W  jcyyc  0.404077  0.507962
3        A          W  kucem  0.367319  0.546055
4        A          W  zwphf  0.802427  0.673711
5        A          X    All  0.456074  0.571174
6        A          X  cvksn  0.548099  0.791990
7        A          X  ldbxz  0.554260  0.148214
8        A          X  vkcqh  0.265864  0.773320
9        B        All    All  0.522005  0.390516
10       B          Y    All  0.621808  0.325927
11       B          Y  cqokn  0.577504  0.252534
12       B          Y  spghx  0.472612  0.487480
13       B          Y  wjsxg  0.815309  0.237767
14       B          Z    All  0.422202  0.455104
15       B          Z  holoo  0.459850  0.432980
16       B          Z  lnihf  0.667877  0.317932
17       B          Z  wirhq  0.138879  0.614401

注意,我在fillna之前使用过sort_values,以确保排序能给出正确的输出结果