在pandas中用group by的百分比注释每一行?

时间:2016-07-04 11:32:06

标签: python pandas

我有一个如下所示的数据框:

Company       Speciality      Payment
AcmeCorp      Roofing         50.00
AcmeCorp      Grounding       50.00
LolCorp       Roofing         106.00
LolCorp       Grounding       94.00

我想添加一个像这样的百分比列:

Company       Speciality      Payment     Percent of Total Payment
AcmeCorp      Roofing         50.00       50
AcmeCorp      Grounding       50.00       50
LolCorp       Roofing         106.00      53
LolCorp       Grounding       94.00       47

最好的方法是什么?

我可以用这样的东西搞乱:

df_m = df.groupby('Company').sum()
final_df = pd.merge(df, df_m, on='Company', suffixes=('Raw', 'Total))
final_df['Percent of Total Payment] = final_df['Payment Raw'] / final_df['Payment_Total']

但我想知道是否有更有效的方法。

1 个答案:

答案 0 :(得分:4)

使用groupby/transform生成与原始DataFrame长度相同的列。这样可以避免调用pd.merge

import numpy as np
import pandas as pd

df = pd.DataFrame({'Company': ['AcmeCorp', 'AcmeCorp', 'LolCorp', 'LolCorp'],
 'Payment': [50.0, 50.0, 106, 94.00],
 'Speciality': ['Roofing', 'Grounding', 'Roofing', 'Grounding']})

total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total
print(df)

产量

    Company  Payment Speciality  percent
0  AcmeCorp     50.0    Roofing     0.50
1  AcmeCorp     50.0  Grounding     0.50
2   LolCorp    106.0    Roofing     0.53
3   LolCorp     94.0  Grounding     0.47

虽然

total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total

可以简化为单行,

df['percent'] = df.groupby('Company')['Payment'].transform(lambda x: x/x.sum())

因为像.transform('sum')这样的内置操作比自定义函数(例如.transform(lambda x: x/x.sum()))更快,所以两行版本更快(特别是对于大型DataFrame)。

当然,两行版本也可以写成

df['percent'] = df['Payment'] / df.groupby('Company')['Payment'].transform('sum')

没有速度损失,少了一个命名变量,但也许有点难以阅读。

以下是100K行DataFrame的基准测试:

In [53]: %timeit using_transform(df)
100 loops, best of 3: 8.5 ms per loop

In [54]: %timeit using_one_liner(df)
10 loops, best of 3: 20.2 ms per loop

In [55]: %timeit orig(df)
10 loops, best of 3: 30.2 ms per loop

这是用于执行基准测试的设置。

import numpy as np
import pandas as pd

N = 10**5
df = pd.DataFrame({'Company': np.random.choice(list('ABCD'), size=N),
    'Payment': np.random.randint(10, size=N),
    'Speciality': np.random.choice(list('XYZ'), size=N)})

def using_transform(df):
    total = df.groupby('Company')['Payment'].transform('sum')
    df['percent'] = df['Payment']/total
    return df

def using_one_liner(df):
    df['percent'] = df.groupby('Company')['Payment'].transform(lambda x: x/x.sum())
    return df

def orig(df):
    df_m = df.groupby('Company').sum()
    final_df = pd.merge(df, df_m, left_on='Company', right_index=True, suffixes=('_Raw', '_Total'))
    final_df['Percent of Total Payment'] = final_df['Payment_Raw'] / final_df['Payment_Total']
    return final_df