根据其他行中的值删除行

时间:2018-12-21 06:36:28

标签: python python-3.x pandas dataframe pandas-groupby

我一直在寻找一种方法,根据要检查的条件在另一行中从数据框中删除行。

这是我的数据框:

product product_id  account_status
prod-A  100         active
prod-A  100         cancelled
prod-A  300         active
prod-A  400         cancelled

如果针对产品&和product_id组合存在account_status ='active'的行,则保留该行并删除其他行。

所需的输出是:

product product_id  account_status
prod-A  100         active
prod-A  300         active
prod-A  400         cancelled

我看到解决方案提到了here,但无法将其复制为字符串。

请提出建议。

2 个答案:

答案 0 :(得分:3)

IMO,groupby不是必需的(我这样说是因为您已经相应地标记了您的问题),可以利用sort_valuesdrop_duplicates,利用“活动” <“已取消”,按字典顺序:

(df.sort_values(['account_status'])
   .drop_duplicates(['product', 'product_id'])
   .sort_index())

  product  product_id account_status
0  prod-A         100         active
2  prod-A         300         active
3  prod-A         400      cancelled

本着其他答案保持一致的精神,您可能希望看一下涉及groupby和掩蔽的基于duplicated的解决方案。

df
  product  product_id account_status
0  prod-A         100         active
1  prod-A         100      cancelled
2  prod-A         100        pending
3  prod-A         300         active
4  prod-A         300        pending
5  prod-A         400      cancelled
6  prod-A         500         active
7  prod-A         500         active
8  prod-A         600        pending
9  prod-A         600      cancelled


m1 = (df.assign(m=df.account_status.eq('active'))
        .groupby(['product', 'product_id'])['m']
        .transform('any'))
m2 = df.duplicated(['product', 'product_id'])

df[~(m1 & m2)]

  product  product_id account_status
0  prod-A         100         active
3  prod-A         300         active
5  prod-A         400      cancelled
6  prod-A         500         active
8  prod-A         600        pending
9  prod-A         600      cancelled

与其他解决方案一样,这也将“很好地”推广到多个类别,并且仅在还存在“活动”的组中删除与其他状态相对应的行。

答案 1 :(得分:1)

对于一般解决方案,如果每个组中至少存在一个account_status值,则每个组仅删除另一个active值:

print (df)
  product  product_id account_status
0  prod-A         100         active
1  prod-A         100      cancelled <- necessary remove
2  prod-A         300         active
3  prod-A         400      cancelled
4  prod-A         500         active
5  prod-A         500         active
6  prod-A         600      cancelled
7  prod-A         600      cancelled

s = df['account_status'].eq('active')
g = df.assign(A=s).groupby(['product','product_id'])['A']
mask = ~g.transform('any') | g.transform('all') | s
df = df[mask]
print (df)
  product  product_id account_status
0  prod-A         100         active
2  prod-A         300         active
3  prod-A         400      cancelled
4  prod-A         500         active
5  prod-A         500         active
6  prod-A         600      cancelled
7  prod-A         600      cancelled

在多个类别上也能很好地工作:

print (df)
  product  product_id account_status
0  prod-A         100         active
1  prod-A         100      cancelled <- necessary remove
2  prod-A         100        pending <- necessary remove
3  prod-A         300         active
4  prod-A         300        pending <- necessary remove
5  prod-A         400      cancelled
6  prod-A         500         active
7  prod-A         500         active
8  prod-A         600        pending
9  prod-A         600      cancelled

s = df['account_status'].eq('active')
g = df.assign(A=s).groupby(['product','product_id'])['A']
mask = ~g.transform('any') | g.transform('all') | s
df = df[mask]
print (df)
  product  product_id account_status
0  prod-A         100         active
3  prod-A         300         active
5  prod-A         400      cancelled
6  prod-A         500         active
7  prod-A         500         active
8  prod-A         600        pending
9  prod-A         600      cancelled