我一直在寻找一种方法,根据要检查的条件在另一行中从数据框中删除行。
这是我的数据框:
product product_id account_status
prod-A 100 active
prod-A 100 cancelled
prod-A 300 active
prod-A 400 cancelled
如果针对产品&和product_id组合存在account_status ='active'的行,则保留该行并删除其他行。
所需的输出是:
product product_id account_status
prod-A 100 active
prod-A 300 active
prod-A 400 cancelled
我看到解决方案提到了here,但无法将其复制为字符串。
请提出建议。
答案 0 :(得分:3)
IMO,groupby
不是必需的(我这样说是因为您已经相应地标记了您的问题),可以利用sort_values
和drop_duplicates
,利用“活动” <“已取消”,按字典顺序:
(df.sort_values(['account_status'])
.drop_duplicates(['product', 'product_id'])
.sort_index())
product product_id account_status
0 prod-A 100 active
2 prod-A 300 active
3 prod-A 400 cancelled
本着其他答案保持一致的精神,您可能希望看一下涉及groupby
和掩蔽的基于duplicated
的解决方案。
df
product product_id account_status
0 prod-A 100 active
1 prod-A 100 cancelled
2 prod-A 100 pending
3 prod-A 300 active
4 prod-A 300 pending
5 prod-A 400 cancelled
6 prod-A 500 active
7 prod-A 500 active
8 prod-A 600 pending
9 prod-A 600 cancelled
m1 = (df.assign(m=df.account_status.eq('active'))
.groupby(['product', 'product_id'])['m']
.transform('any'))
m2 = df.duplicated(['product', 'product_id'])
df[~(m1 & m2)]
product product_id account_status
0 prod-A 100 active
3 prod-A 300 active
5 prod-A 400 cancelled
6 prod-A 500 active
8 prod-A 600 pending
9 prod-A 600 cancelled
与其他解决方案一样,这也将“很好地”推广到多个类别,并且仅在还存在“活动”的组中删除与其他状态相对应的行。
答案 1 :(得分:1)
对于一般解决方案,如果每个组中至少存在一个account_status
值,则每个组仅删除另一个active
值:
print (df)
product product_id account_status
0 prod-A 100 active
1 prod-A 100 cancelled <- necessary remove
2 prod-A 300 active
3 prod-A 400 cancelled
4 prod-A 500 active
5 prod-A 500 active
6 prod-A 600 cancelled
7 prod-A 600 cancelled
s = df['account_status'].eq('active')
g = df.assign(A=s).groupby(['product','product_id'])['A']
mask = ~g.transform('any') | g.transform('all') | s
df = df[mask]
print (df)
product product_id account_status
0 prod-A 100 active
2 prod-A 300 active
3 prod-A 400 cancelled
4 prod-A 500 active
5 prod-A 500 active
6 prod-A 600 cancelled
7 prod-A 600 cancelled
在多个类别上也能很好地工作:
print (df)
product product_id account_status
0 prod-A 100 active
1 prod-A 100 cancelled <- necessary remove
2 prod-A 100 pending <- necessary remove
3 prod-A 300 active
4 prod-A 300 pending <- necessary remove
5 prod-A 400 cancelled
6 prod-A 500 active
7 prod-A 500 active
8 prod-A 600 pending
9 prod-A 600 cancelled
s = df['account_status'].eq('active')
g = df.assign(A=s).groupby(['product','product_id'])['A']
mask = ~g.transform('any') | g.transform('all') | s
df = df[mask]
print (df)
product product_id account_status
0 prod-A 100 active
3 prod-A 300 active
5 prod-A 400 cancelled
6 prod-A 500 active
7 prod-A 500 active
8 prod-A 600 pending
9 prod-A 600 cancelled