我有状态重复的客户重复项,因为每个客户订阅/产品都有一行。我想为客户生成一个new_status
并使其“取消”,每个订阅状态都必须一起“取消”。
我用过:
df['duplicated'] = df.groupby('customer', as_index=False)['customer'].cumcount()
分隔索引中的每个重复项以指示重复的值
Customer | Status | new_status | duplicated
X |canceled| | 0
X |canceled| | 1
X |active | | 2
Y |canceled| | 0
A |canceled| | 0
A |canceled| | 1
B |active | | 0
B |canceled| | 1
因此,我想使用.apply和/或.loc生成:
Customer | Status | new_status | duplicated
X |canceled| | 0
X |canceled| | 1
X |active | | 2
Y |canceled| | 0
A |canceled| canceled | 0
A |canceled| canceled | 1
B |active | | 0
B |canceled| | 1
答案 0 :(得分:2)
用Series.eq
来比较==
的列,并使用GroupBy.transform
和GroupBy.all
来检查每个组中是否所有值都是True
,然后比较{{1 }} Series.duplicated
与Customer
一起返回所有重复。最后按位keep=False
(AND
)链接在一起,并按numpy.where
设置值:
&
答案 1 :(得分:1)
据我了解,您可以尝试做:
df['new_status']=(df.groupby('Customer')['Status'].
transform(lambda x: x.eq('canceled').all()).map({True:'cancelled'})).fillna(df.new_status)
print(df)
Customer Status new_status duplicated
0 X canceled 0
1 X canceled 1
2 X active 2
3 Y canceled cancelled 0
4 A canceled cancelled 0
5 A canceled cancelled 1
6 B active 0
7 B canceled 1
由于预期的o / p已更改,因此进行了编辑:
df['new_status']=(df.groupby('Customer')['Status'].
transform(lambda x: x.duplicated(keep=False)&(x.eq('canceled').all()))
.map({True:'cancelled',False:''}))
print(df)
Customer Status new_status duplicated
0 X canceled 0
1 X canceled 1
2 X active 2
3 Y canceled 0
4 A canceled cancelled 0
5 A canceled cancelled 1
6 B active 0
7 B canceled 1