按类别列的子集分组

时间:2017-08-28 14:42:34

标签: python pandas group-by

如下面的简化示例所示,我有一个包含两个类别列和一个值列的数据框。两个类别列都可以取值“A”,“B”或“C”。我想在这些类别上汇总值列,但我还想考虑类别第1列是“A”还是“B”的情况。换句话说,我不仅想要对类别列的所有可通过值进行迭代,还要对这些值的给定组合进行迭代。
我能想到的最好的是:

import pandas as pd
import numpy as np
df = pd.DataFrame({'cat1':['A','A','B','B','C','C'],
              'cat2':['A','B','A','B','A','B'],
              'x':[1,3,1,3,1,3]})
res1 = df.groupby(['cat1','cat2']).sum()
res2 = df.loc[(df['cat1'] == 'A') | (df['cat1'] == 'B'),['cat2','x']].groupby('cat2').sum()
res2['cat1'] = 'AB'
res2 = res2.reset_index().set_index(['cat1','cat2'])
res = res1.append(res2)
res
                x
cat1    cat2    
A         A     1
          B     3
B         A     1
          B     3
C         A     1
          B     3
AB        A     2
          B     6

它工作正常,但由于我的实际任务涉及的不仅仅是两个具有三个可能值的类别,这种方法变得混乱。是否有更优雅或更有效的方式来做到这一点?

在回复评论时,这里有示例数据和背后的逻辑。 说,我正在研究一群大学生的健身房习惯。样本数据是:

    VisitTime   Collage mode of commute     TimeAtJym
0   noon-17:00  a&s     bus                 30
1   noon-17:00  a&s     car, then bus       45
2   noon-17:00  a&s     car (only)          90
3   17:00-22:00 a&s     bus                 40
4   17:00-22:00 a&s     car, then bus       50
5   17:00-22:00 a&s     car (only)          35
6   7:00-noon   a&s     bus                 55
7   7:00-noon   a&s     car, then bus       70
8   7:00-noon   a&s     car (only)          40
9   noon-17:00  law     bus                 45
10  noon-17:00  law     car, then bus       40
11  noon-17:00  law     car (only)           4
12  17:00-22:00 law     bus                 90
13  17:00-22:00 law     car, then bus       120
14  17:00-22:00 law     car (only)          30
15  7:00-noon   law     bus                 25
16  7:00-noon   law     car, then bus       90
17  7:00-noon   law     car (only)          80  

我想知道按时间和运输方式花在娱乐上的平均时间,所以我这样做:

res1 = sss.groupby(['VisitTime','mode of commute']).mean()
res1

                               TimeAtJym
VisitTime       mode of commute     
17:00-22:00     bus             65.0
                car (only)      32.5
                car, then bus   85.0
7:00-noon       bus             40.0
                car (only)      60.0
                car, then bus   80.0
noon-17:00      bus             37.5
                car (only)      47.0
                car, then bus   42.5

我也想知道非晚会学生的平均时间,所以我这样做:

res2 = sss.loc[(sss['VisitTime'] == '7:00-noon') | (sss['VisitTime'] == 'noon-17:00')].groupby(['mode of commute']).mean()
res2['VisitTime'] = 'not evening'
res2 = res2.reset_index().set_index(['VisitTime','mode of commute'])
res2

                            TimeAtJym
VisitTime   mode of commute     
not evening bus             38.75
            car (only)      53.50
            car, then bus   61.25

如果我考虑乘坐公共汽车的非夜间学生(只有公共汽车汽车,然后是公共汽车),那么我需要稍微长一点的表达在我的实际研究中,因为子类别的数量较大,我采取的方法变得混乱。出于同样的原因,使用!=代替== | ==无济于事。我希望这能澄清我的问题

0 个答案:

没有答案