我有以下数据框:
time id type
2012-12-19 1 abcF1
2013-11-02 1 xF1yz
2012-12-19 1 abcF1
2012-12-18 1 abcF1
2013-11-02 1 xF1yz
2006-07-07 5 F5spo
2006-07-06 5 F5spo
2005-07-07 5 F5abc
对于给定的ID,我需要找到最长日期。
对于那个最大日期,我需要检查类型。
如果类型与最长日期的类型不同,我必须删除给定ID的每一行。
目标数据框示例:
time id type
<deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
2013-11-02 1 xF1yz
<deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
<deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
2013-11-02 1 xF1yz
2006-07-07 5 F5spo
2006-07-06 5 F5spo //kept because although the date is not max, it has the same type as the row with the max date for id 5
<deleted because for id 5 the date is not the max value and the type differs from the type of the max date for id 5>
我该如何实现? 我是熊猫的新手,正在尝试学习使用该库的正确方法。
答案 0 :(得分:2)
使用DataFrameGroupBy.idxmax
获取最大值索引,仅过滤id
和type
和DataFrame.merge
列:
df = df.merge(df.loc[df.groupby('id')['time'].idxmax(), ['id','type']])
print (df)
time id type
0 2013-11-02 1 xF1yz
1 2013-11-02 1 xF1yz
2 2006-07-07 5 F5spo
3 2006-07-06 5 F5spo
或将DataFrame.sort_values
与DataFrame.drop_duplicates
一起使用:
df = df.merge(df.sort_values('time').drop_duplicates('id', keep='last')[["id", "type"]])
答案 1 :(得分:1)
您可以按时间对数据框进行排序,然后按ID分组并选择每个组中的最后一行。那是日期最大的行。
last_rows = df.sort_values('time').groupby('id').last()
然后将原始数据框与新的数据框合并:
result = df.merge(last_rows, on=["id", "type"])
# time_x id type time_y
#0 2013-11-02 1 xF1yz 2013-11-02
#1 2013-11-02 1 xF1yz 2013-11-02
#2 2006-07-07 5 F5spo 2006-07-07
#3 2006-07-06 5 F5spo 2006-07-07
如果需要,删除最后一个重复的列:
result.drop('time_y', axis=1, inplace=True)
答案 2 :(得分:1)
使用set_index
,groupby
和transform
idxmax
创建助手Series
。然后使用boolean indexing
:
# If neccessary cast to datetime dtype
# df['time'] = pd.to_datetime(df['time'])
s = df.set_index('type').groupby('id')['time'].transform('idxmax')
df[df.type == s.values]
[出]
time id type
1 2013-11-02 1 xF1yz
4 2013-11-02 1 xF1yz
5 2006-07-07 5 F5spo
6 2006-07-06 5 F5spo
答案 3 :(得分:0)
import pandas as pd
df = pd.DataFrame({
'time': ['2012-12-19', '2013-11-02', '2013-12-19', '2013-12-18', '2013-11-02', '2006-07-07', '2006-07-06', '2005-07-07'],
'id': [1,1,1,1,1,5,5,5],
'type': ['abcF1', 'xF1yz', 'abcF1', 'abcF1', 'xF1yz', 'F5spo', 'F5spo', 'F5abc']
})
df['time'] = pd.to_datetime(df['time'])
def remove_non_max_date_ids(df):
max_type = df.loc[df['time'].idxmax()]['type']
print(max_type)
return df[
df['type'] != max_type
]
df.groupby('id').apply(remove_non_max_date_ids)
创建一个辅助函数,以过滤出与最大日期类型不同的行,然后基于id
答案 4 :(得分:0)
使用duplicated的另一种方式。
import pandas as pd
import datetime
# if needed
df['time'] = pd.to_datetime(df['time'])
# sort values of id and time ascendingly, and tagged the duplicates
df = df.sort_values(by=['id','time'], ascending=[True,True])
df['time_max'] = df.duplicated(subset=['id'], keep='last')
# keep the max value only
df2 = df.loc[~df['time_max'],['id','type']].rename(columns={'type':'type_max'}).copy()
# merge with the original df
df = pd.merge(df, df2, on=['id'], how='left')
# get the result
df['for_drop'] = df['type']==df['type_max']
df = df.loc[df['for_drop'],:]
[输出]:
df
time id type time_max type_max for_drop
3 2013-11-02 1 xF1yz True xF1yz True
4 2013-11-02 1 xF1yz False xF1yz True
6 2006-07-06 5 F5spo True F5spo True
7 2006-07-07 5 F5spo False F5spo True