Question

我有一个看起来像这样的pandas数据框。

set language    group   version metric_1    metric_2    metric_3
X   English     1       A       100         20          5
X   French      2       A       90          10          10
X   English     1       B       80          30          15
X   French      2       B       70          20          20
Y   English     1       A       200         20          30
Y   French      2       A       180         30          20
Y   English     1       B       160         10          10
Y   French      2       B       140         20          5

我想用实验属性的所有组合汇总指标 - 集合，语言，组和＆amp;版。因此摘要数据框将如下所示。

set language    group   version metric_1    metric_2    metric_3
X                               800         140         80
Y                               1000        140         80
    English                     1200        200         80
    French                      600         80          80
                1               1050        120         60
                2               750         160         100
                        A       850         140         80
                        B       950         140         80
X   English                     500         100         40
X   French                      300         40          40
Y   English                     700         100         40
Y   French                      300         40          40
X               1               350         60          30
X               2               450         80          50
Y               1               700         60          30
Y               2               300         80          50
X                       A       350         70          40
X                       B       450         70          40
Y                       A       500         70          40
Y                       B       500         70          40
    English     1               ...
    English     2               ...
    French      1               ...
    French      2               ...
    English             A       ...
    English             B       ...
    French              A       ...
    French              B       ...
                1       A       ...
                1       B       ...
                2       A       ...
                2       B       ...
X   English     1               ...
X   English     2               ...
X   French      1               ...
X   French      2               ...
Y   English     1               ...
Y   English     2               ...
Y   French      1               ...
Y   French      2               ...
X   English             A       ...
X   English             B       ...
X   French              A       ...
X   French              B       ...
Y   English             A       ...
Y   English             B       ...
Y   French              A       ...
Y   French              B       ...
X               1       A       ...
X               1       B       ...
X               2       A       ...
X               2       B       ...
Y               1       A       ...
Y               1       B       ...
Y               2       A       ...
Y               2       B       ...
    English     1       A       ...
    English     1       B       ...
    English     2       A       ...
    English     2       B       ...
    French      1       A       ...
    French      1       B       ...
    French      2       A       ...
    French      2       B       ...

我知道我可以通过使用groupby的不同组合并将所有这些组合连接到单个数据帧中来实现这种强力。这可能会扩展到更多属性，所以我试图找到一个更具可扩展性的解决方案。我一直在阅读通过itertools提供的功能，但不确定它们将如何应用。

对此有任何想法/指示。谢谢！

Answer 1

事实上，itertools的{{3}}函数可以帮助您创建所有组合。我们假设您的数据位于名为df的数据框中。

from itertools import combinations
# create two list, one for all columns you want to sum, and the others
list_metric = [col for col in df.columns if 'metric' in col]
list_non_metric = [col for col in df.columns if 'metric' not in col]
# create the dataframe grouped on all columns
df_grouped = df.groupby(list_non_metric,as_index=False)[list_metric].sum() 
# use concat and list comprehension to create all the combinations
df_output = (pd.concat([df_grouped.groupby(list(combi),as_index=False)[list_metric].sum() 
                        for j in range(1, len(list_non_metric)+1) 
                          for combi in combinations(list_non_metric,j) ])
                 .fillna(''))
# reorder the columns as the input data (if necessary)
df_output = df_output[df.columns]

如果您想了解如何运作combinations，请尝试打印这些行：

[combi for combi in combinations(list_non_metric,2)]

然后第二个for j in range(1, len(list_non_metric)+1)将有助于创建list_non_metric

的1,2,3，...元素的组合

Answer 2

这是一种方法。我假设您只提供了一部分数据，因为总计不会加起来：

In []:
import itertools as it

cols = df.columns.tolist()
index = ['set', 'language', 'group', 'version']
df = df.set_index(index)
pd.concat([df.groupby(level=x).sum().reset_index()
           for n in range(1, len(index)+1)
           for x in it.combinations(range(len(index)), n)],
          sort=True)[cols].fillna('')

Out[]:
   set language group version  metric_1  metric_2  metric_3
0    X                              340        80        50
1    Y                              680        80        65
0       English                     540        80        60
1        French                     480        80        55
0                   1               540        80        60
1                   2               480        80        55
0                           A       570        80        65
1                           B       450        80        50
0    X  English                     180        50        20
1    X   French                     160        30        30
2    Y  English                     360        30        40
3    Y   French                     320        50        25
...

获取多列组合的指标总和

2 个答案: