我有一个如下所示的数据框:
Company Speciality Payment
AcmeCorp Roofing 50.00
AcmeCorp Grounding 50.00
LolCorp Roofing 106.00
LolCorp Grounding 94.00
我想添加一个像这样的百分比列:
Company Speciality Payment Percent of Total Payment
AcmeCorp Roofing 50.00 50
AcmeCorp Grounding 50.00 50
LolCorp Roofing 106.00 53
LolCorp Grounding 94.00 47
最好的方法是什么?
我可以用这样的东西搞乱:
df_m = df.groupby('Company').sum()
final_df = pd.merge(df, df_m, on='Company', suffixes=('Raw', 'Total))
final_df['Percent of Total Payment] = final_df['Payment Raw'] / final_df['Payment_Total']
但我想知道是否有更有效的方法。
答案 0 :(得分:4)
使用groupby/transform
生成与原始DataFrame长度相同的列。这样可以避免调用pd.merge
。
import numpy as np
import pandas as pd
df = pd.DataFrame({'Company': ['AcmeCorp', 'AcmeCorp', 'LolCorp', 'LolCorp'],
'Payment': [50.0, 50.0, 106, 94.00],
'Speciality': ['Roofing', 'Grounding', 'Roofing', 'Grounding']})
total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total
print(df)
产量
Company Payment Speciality percent
0 AcmeCorp 50.0 Roofing 0.50
1 AcmeCorp 50.0 Grounding 0.50
2 LolCorp 106.0 Roofing 0.53
3 LolCorp 94.0 Grounding 0.47
虽然
total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total
可以简化为单行,
df['percent'] = df.groupby('Company')['Payment'].transform(lambda x: x/x.sum())
因为像.transform('sum')
这样的内置操作比自定义函数(例如.transform(lambda x: x/x.sum())
)更快,所以两行版本更快(特别是对于大型DataFrame)。
当然,两行版本也可以写成
df['percent'] = df['Payment'] / df.groupby('Company')['Payment'].transform('sum')
没有速度损失,少了一个命名变量,但也许有点难以阅读。
以下是100K行DataFrame的基准测试:
In [53]: %timeit using_transform(df)
100 loops, best of 3: 8.5 ms per loop
In [54]: %timeit using_one_liner(df)
10 loops, best of 3: 20.2 ms per loop
In [55]: %timeit orig(df)
10 loops, best of 3: 30.2 ms per loop
这是用于执行基准测试的设置。
import numpy as np
import pandas as pd
N = 10**5
df = pd.DataFrame({'Company': np.random.choice(list('ABCD'), size=N),
'Payment': np.random.randint(10, size=N),
'Speciality': np.random.choice(list('XYZ'), size=N)})
def using_transform(df):
total = df.groupby('Company')['Payment'].transform('sum')
df['percent'] = df['Payment']/total
return df
def using_one_liner(df):
df['percent'] = df.groupby('Company')['Payment'].transform(lambda x: x/x.sum())
return df
def orig(df):
df_m = df.groupby('Company').sum()
final_df = pd.merge(df, df_m, left_on='Company', right_index=True, suffixes=('_Raw', '_Total'))
final_df['Percent of Total Payment'] = final_df['Payment_Raw'] / final_df['Payment_Total']
return final_df