我想为表中的每个记录基于两个分类列进行累计计数。
在下表中,我想获取 cum_count 列,该列是根据 industry 和 deal_status 列计算的。这个想法是,针对每条记录,计算同一行业的上一个获胜交易的数量。
例如,该表的最后一条记录的 cum_count = 3,因为只有3个 deal_status =赢得了 industry = x的交易以前见过。
Pandas' GroupBy.cumcount function仅针对单个变量执行此操作...
如何针对我所描述的情况来做到这一点?
pd.DataFrame({'time': [1, 2, 3, 4, 5, 6, 7],
'company' : ["ciaA", "ciaB", "ciaA", "ciaC", "ciaA", "ciaD", "ciaE"],
'industry' : ["x", "y", "x", "x", "x", "y", "x"],
'deal_status' : ["won", "lost", "won", "won", "lost", "won", "lost"],
'cum_count' : [0, 0, 1, 2, 3, 0, 3]})
time company industry deal_status cum_count
1 ciaA x won 0
2 ciaB y lost 0
3 ciaA x won 1
4 ciaC x won 2
5 ciaA x lost 3
6 ciaD y won 0
7 ciaE x lost 3
答案 0 :(得分:3)
创建一个助手列,您将使用该累加器的总和。需要在每个组中移动,因为您的计数仅包括上一个获胜值:
df['to_sum'] = (df.deal_status == 'won').astype(int)
df['cum_count'] = (df.groupby('industry')
.apply(lambda x: x.to_sum.shift(1).cumsum()).fillna(0)
.reset_index(0, drop=True))
df
: time company industry deal_status to_sum cum_count
0 1 ciaA x won 1 0.0
1 2 ciaB y lost 0 0.0
2 3 ciaA x won 1 1.0
3 4 ciaC x won 1 2.0
4 5 ciaA x lost 0 3.0
5 6 ciaD y won 1 0.0
6 7 ciaE x lost 0 3.0