如何创建只在列具有相同值的行子集上完成操作的groupby函数?
因此,在下表中,我希望基于相同的doclist对行进行子集化,然后仅为列“组织”的文档列表子集添加NP和Pr。
Organization NP Pr
0 doclist[0] 0 0
1 doclist[0] 1 0
4 doclist[1] 1 0
5 doclist[4] 1 0
6 doclist[4] 0 1
想在下面使用.apply() - 还是有更好/更有效的方式?
Organization NP Pr Sum
0 doclist[0] 0 0 1
1 doclist[0] 1 0 1
4 doclist[1] 1 0 1
5 doclist[4] 1 0 2
6 doclist[4] 0 1 2
答案 0 :(得分:4)
我想看看groupby
,我认为 - "仅对其中一列"具有相同值的行进行操作。部分 - 因为看起来你希望每一行得到适当的总和,我想你想在那上面调用.transform
。 transform
"广播"分组值直到完整的数据帧。
df["Sum"] = (df["NP"] + df["Pr"]).groupby(df["Organization"]).transform("sum")
例如:
>>> df
Organization NP Pr
0 doclist[0] 0 0
1 doclist[0] 1 0
4 doclist[1] 1 0
5 doclist[4] 1 0
6 doclist[4] 0 1
[5 rows x 3 columns]
>>> df["Sum"] = (df["NP"] + df["Pr"]).groupby(df["Organization"]).transform("sum")
>>> df
Organization NP Pr Sum
0 doclist[0] 0 0 1
1 doclist[0] 1 0 1
4 doclist[1] 1 0 1
5 doclist[4] 1 0 2
6 doclist[4] 0 1 2
[5 rows x 4 columns]
答案 1 :(得分:2)
可能有一种更有效的方法,(并且你可以更可读地写出来)但你可以总是这样做:
import pandas as pd
org = ['doclist[0]', 'doclist[0]', 'doclist[1]', 'doclist[4]', 'doclist[4]']
np = [0, 1, 1, 1, 0]
pr = [0, 0, 0, 0, 1]
df = pd.DataFrame({'Organization':org, 'NP':np, 'Pr':pr})
# Make a "lookup" dataframe of the sums for each category
# (Both the "NP" and "Pr" colums of "sums" will contain the result)
sums = df.groupby('Organization').agg(lambda x: x['NP'].sum() + x['Pr'].sum())
# Lookup the result based on the contents of the "Organization" row
df['Sum'] = df.apply(lambda row: sums.ix[row['Organization']]['NP'], axis=1)
这是不可读的,所以用这种方式写它可能会更清楚一点:
import pandas as pd
org = ['doclist[0]', 'doclist[0]', 'doclist[1]', 'doclist[4]', 'doclist[4]']
np = [0, 1, 1, 1, 0]
pr = [0, 0, 0, 0, 1]
df = pd.DataFrame({'Organization':org, 'NP':np, 'Pr':pr})
# Make a "lookup" dataframe of the sums for each category
lookup = df.groupby('Organization').agg(lambda x: x['NP'].sum() + x['Pr'].sum())
# Lookup the result based on the contents of the "Organization" row
# The "lookup" dataframe will have the relevant sum in _both_ "NP" and "Pr"
def func(row):
org = row['Organization']
group_sum = lookup.ix[org]['NP']
return group_sum
df['Sum'] = df.apply(func, axis=1)
顺便说一句,@ DSM似乎是一种更好的方法。