我需要分组并根据条件过滤掉pandas数据帧中的重复项。我的数据框看起来像这样:
import pandas as pd
df = pd.DataFrame({'ID':[1,1,2,2,3,4,4],'Date':['1/1/2001','1/1/1999','1/1/2010','1/1/2004','1/1/2000','1/1/2001','1/1/2000'], 'type':['yes','yes','yes','yes','no','no','no'], 'source':[3,1,1,2,2,2,1]})
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('ID')
df
Date source type
ID
1 2001-01-01 3 yes
1 1999-01-01 1 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 2 no
4 2000-01-01 1 no
我需要分组ID和类型以及任何地方类型== yes只有在拥有最高来源时才保留最新记录。如果最新记录没有最高来源,则保留两个记录 期望的输出:
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 2 no
4 2000-01-01 1 no
我尝试过使用转换,但无法弄清楚如何应用条件:
grouped = df.groupby(['ID','type'])['Date'].transform(max)
df = df.loc[df['Date'] == grouped]
df
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 2 no
非常感谢任何帮助
如果我有一个包含更多行的数据帧(我有大约70列和5000行),那么这就是问题。它没有考虑源max。
Date source type
ID
1 2001-01-01 3 yes
1 1999-01-01 1 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 1 yes
4 2000-01-01 2 yes
使用你得到的代码:
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 1 yes
它应该是:
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 1 yes
4 2000-01-01 2 yes
答案 0 :(得分:2)
这需要pd.concat
grouped = df.groupby(['type'])['Date'].transform(max)# I change this line seems like you need groupby type
s = df.loc[df['Date'] == grouped].index
#here we split the df into two part , one need to drop the not match row , one should keep all row
pd.concat([df.loc[df.index.difference(s)].sort_values('Date').groupby('ID').tail(1),df.loc[s]]).sort_index()
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 2 no
4 2000-01-01 1 no
更新
grouped = df.groupby(['type'])['source'].transform(max)
s = df.loc[df['source'] == grouped].index
pd.concat([df.loc[s].sort_values('Date').groupby('ID').tail(1),df.loc[df.index.difference(s)]]).sort_index()
Out[445]:
Date source type
ID
1 2001-01-01 3 yes
2 2010-01-01 1 yes
2 2004-01-01 2 yes
3 2000-01-01 2 no
4 2001-01-01 1 yes
4 2000-01-01 2 yes