根据更复杂的条件在熊猫中删除行

时间:2019-05-10 06:39:04

标签: python pandas

我有以下数据框:

time        id  type
2012-12-19  1   abcF1
2013-11-02  1   xF1yz
2012-12-19  1   abcF1
2012-12-18  1   abcF1
2013-11-02  1   xF1yz
2006-07-07  5   F5spo
2006-07-06  5   F5spo
2005-07-07  5   F5abc

对于给定的ID,我需要找到最长日期。

对于那个最大日期,我需要检查类型。

如果类型与最长日期的类型不同,我必须删除给定ID的每一行。

目标数据框示例:

time        id  type
<deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
2013-11-02  1   xF1yz
<deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
<deleted because for id 1 the date is not the max value and the type differs from the type of the max date for id 1>
2013-11-02  1   xF1yz
2006-07-07  5   F5spo
2006-07-06  5   F5spo //kept because although the date is not max, it has the same type as the row with the max date for id 5
<deleted because for id 5 the date is not the max value and the type differs from the type of the max date for id 5>

我该如何实现? 我是熊猫的新手,正在尝试学习使用该库的正确方法。

5 个答案:

答案 0 :(得分:2)

使用DataFrameGroupBy.idxmax获取最大值索引,仅过滤idtypeDataFrame.merge列:

df = df.merge(df.loc[df.groupby('id')['time'].idxmax(), ['id','type']])
print (df)
        time  id   type
0 2013-11-02   1  xF1yz
1 2013-11-02   1  xF1yz
2 2006-07-07   5  F5spo
3 2006-07-06   5  F5spo

或将DataFrame.sort_valuesDataFrame.drop_duplicates一起使用:

df = df.merge(df.sort_values('time').drop_duplicates('id', keep='last')[["id", "type"]])

答案 1 :(得分:1)

您可以按时间对数据框进行排序,然后按ID分组并选择每个组中的最后一行。那是日期最大的行。

last_rows = df.sort_values('time').groupby('id').last()

然后将原始数据框与新的数据框合并:

result = df.merge(last_rows, on=["id", "type"])
#       time_x  id   type      time_y
#0  2013-11-02   1  xF1yz  2013-11-02
#1  2013-11-02   1  xF1yz  2013-11-02
#2  2006-07-07   5  F5spo  2006-07-07
#3  2006-07-06   5  F5spo  2006-07-07

如果需要,删除最后一个重复的列:

result.drop('time_y', axis=1, inplace=True)

答案 2 :(得分:1)

使用set_indexgroupbytransform idxmax创建助手Series。然后使用boolean indexing

# If neccessary cast to datetime dtype
# df['time'] = pd.to_datetime(df['time'])

s = df.set_index('type').groupby('id')['time'].transform('idxmax')
df[df.type == s.values]

[出]

        time  id   type
1 2013-11-02   1  xF1yz
4 2013-11-02   1  xF1yz
5 2006-07-07   5  F5spo
6 2006-07-06   5  F5spo

答案 3 :(得分:0)

import pandas as pd

df = pd.DataFrame({
    'time': ['2012-12-19', '2013-11-02', '2013-12-19', '2013-12-18', '2013-11-02', '2006-07-07', '2006-07-06', '2005-07-07'],
    'id': [1,1,1,1,1,5,5,5],
    'type': ['abcF1', 'xF1yz', 'abcF1', 'abcF1', 'xF1yz', 'F5spo', 'F5spo', 'F5abc']
})

df['time'] = pd.to_datetime(df['time'])
def remove_non_max_date_ids(df):
    max_type = df.loc[df['time'].idxmax()]['type']
    print(max_type)
    return df[
        df['type'] != max_type
    ]

df.groupby('id').apply(remove_non_max_date_ids)

创建一个辅助函数,以过滤出与最大日期类型不同的行,然后基于id

将其应用于每个组df

答案 4 :(得分:0)

使用duplicated的另一种方式。

import pandas as pd
import datetime

# if needed
df['time'] = pd.to_datetime(df['time'])

# sort values of id and time ascendingly, and tagged the duplicates
df = df.sort_values(by=['id','time'], ascending=[True,True])
df['time_max'] = df.duplicated(subset=['id'], keep='last')
# keep the max value only
df2 = df.loc[~df['time_max'],['id','type']].rename(columns={'type':'type_max'}).copy()

# merge with the original df
df = pd.merge(df, df2, on=['id'], how='left')
# get the result
df['for_drop'] = df['type']==df['type_max']
df = df.loc[df['for_drop'],:]

[输出]:

df
    time        id  type    time_max    type_max    for_drop
3   2013-11-02  1   xF1yz   True          xF1yz       True
4   2013-11-02  1   xF1yz   False         xF1yz       True
6   2006-07-06  5   F5spo   True          F5spo       True
7   2006-07-07  5   F5spo   False         F5spo       True