如下面的简化示例所示,我有一个包含两个类别列和一个值列的数据框。两个类别列都可以取值“A”,“B”或“C”。我想在这些类别上汇总值列,但我还想考虑类别第1列是“A”还是“B”的情况。换句话说,我不仅想要对类别列的所有可通过值进行迭代,还要对这些值的给定组合进行迭代。
我能想到的最好的是:
import pandas as pd
import numpy as np
df = pd.DataFrame({'cat1':['A','A','B','B','C','C'],
'cat2':['A','B','A','B','A','B'],
'x':[1,3,1,3,1,3]})
res1 = df.groupby(['cat1','cat2']).sum()
res2 = df.loc[(df['cat1'] == 'A') | (df['cat1'] == 'B'),['cat2','x']].groupby('cat2').sum()
res2['cat1'] = 'AB'
res2 = res2.reset_index().set_index(['cat1','cat2'])
res = res1.append(res2)
res
x
cat1 cat2
A A 1
B 3
B A 1
B 3
C A 1
B 3
AB A 2
B 6
它工作正常,但由于我的实际任务涉及的不仅仅是两个具有三个可能值的类别,这种方法变得混乱。是否有更优雅或更有效的方式来做到这一点?
在回复评论时,这里有示例数据和背后的逻辑。 说,我正在研究一群大学生的健身房习惯。样本数据是:
VisitTime Collage mode of commute TimeAtJym
0 noon-17:00 a&s bus 30
1 noon-17:00 a&s car, then bus 45
2 noon-17:00 a&s car (only) 90
3 17:00-22:00 a&s bus 40
4 17:00-22:00 a&s car, then bus 50
5 17:00-22:00 a&s car (only) 35
6 7:00-noon a&s bus 55
7 7:00-noon a&s car, then bus 70
8 7:00-noon a&s car (only) 40
9 noon-17:00 law bus 45
10 noon-17:00 law car, then bus 40
11 noon-17:00 law car (only) 4
12 17:00-22:00 law bus 90
13 17:00-22:00 law car, then bus 120
14 17:00-22:00 law car (only) 30
15 7:00-noon law bus 25
16 7:00-noon law car, then bus 90
17 7:00-noon law car (only) 80
我想知道按时间和运输方式花在娱乐上的平均时间,所以我这样做:
res1 = sss.groupby(['VisitTime','mode of commute']).mean()
res1
TimeAtJym
VisitTime mode of commute
17:00-22:00 bus 65.0
car (only) 32.5
car, then bus 85.0
7:00-noon bus 40.0
car (only) 60.0
car, then bus 80.0
noon-17:00 bus 37.5
car (only) 47.0
car, then bus 42.5
我也想知道非晚会学生的平均时间,所以我这样做:
res2 = sss.loc[(sss['VisitTime'] == '7:00-noon') | (sss['VisitTime'] == 'noon-17:00')].groupby(['mode of commute']).mean()
res2['VisitTime'] = 'not evening'
res2 = res2.reset_index().set_index(['VisitTime','mode of commute'])
res2
TimeAtJym
VisitTime mode of commute
not evening bus 38.75
car (only) 53.50
car, then bus 61.25
如果我考虑乘坐公共汽车的非夜间学生(只有公共汽车和汽车,然后是公共汽车),那么我需要稍微长一点的表达在我的实际研究中,因为子类别的数量较大,我采取的方法变得混乱。出于同样的原因,使用!=
代替== | ==
无济于事。我希望这能澄清我的问题