要获取基于熊猫人分组的百分比?

时间:2019-02-15 22:00:13

标签: python pandas

我有一个这样的熊猫DataFrame:

subject bool Count
1   False   329232  
1   True    73896   
2   False   268338  
2   True    76424   
3   False   186167  
3   True    27078   
4   False   172417  
4   True    113268  

我想将Count转换为每个主题组的百分比。因此,例如,第1行将是329232 / (329232 + 73896) = 0.816,第2行将是73896/ (329232 + 73896) = 0.183。然后,第2组的总数将发生变化,依此类推。

这可以由groupby来完成吗?我尝试遍历行,但收效甚微。

3 个答案:

答案 0 :(得分:2)

这对我有用:

df['Count'] = df['Count'].div(df.groupby('subject')['Count'].transform(lambda x: x.sum()))
print(df)

礼物:

      Count   bool  subject
0  0.816693  False        1
1  0.183307   True        1
2  0.778328  False        2
3  0.221672   True        2
4  0.873019  False        3
5  0.126981   True        3
6  0.603521  False        4
7  0.396479   True        4

答案 1 :(得分:1)

我的解决方案是这样的:

导入相关库

import pandas as pd
import numpy as np

创建数据框 df

d = {'subject':[1,1,2,2,3,3],'bool':[False,True,False,True,False,True],
'count':[329232,73896,268338,76424,186167,27078]}
df = pd.DataFrame(d)

使用 groupbyreset_index

table_sum= df.groupby('subject').sum().reset_index()[['subject','count']]

压缩 groupby 输出并将其设置为 dictionary 并使用地图获取频率

look_1 = (dict(zip(table_sum['subject'],table_sum['count'])))
df['cu_sum'] = df['subject'].map(look_1)
df['relative_frequency'] = df['count']/df['cu_sum']

输出

print(df)

       subject   bool   count  cu_sum  relative_frequency
    0        1  False  329232  403128            0.816693
    1        1   True   73896  403128            0.183307
    2        2  False  268338  344762            0.778328
    3        2   True   76424  344762            0.221672
    4        3  False  186167  213245            0.873019
    5        3   True   27078  213245            0.126981

答案 2 :(得分:-1)

#create df
d = {'subject': [1, 1, 2, 2, 3, 3, 4, 4], 'bool': [False, True, False, True, False, True, False, True], 'Count': [329232,73896
  ,268338,76424,186167,27078,172417,113268]}

df = pd.DataFrame(d)

#get sums for each subject group
sums = pd.DataFrame(df.groupby(['subject'])['Count'].sum().reset_index())
sums.columns = ['subject', 'sums']

#merge sums to original df
df_sums = df.merge(sums, how='left', on='subject')

#calculate percentages for each row
df_sums['percent'] = df_sums['Count']/df_sums['sums']

df_sums