如何仅通过保留首次事件来删除重复项,但仅适用于熊猫的一个类别

时间:2019-07-19 09:13:02

标签: python pandas dataframe duplicates

如何仅通过保留优先事件来删除重复项,但仅适用于熊猫的一个类别。

event_name列上有两个类别,process_nowfast_order,但是删除重复项有一些特殊之处: 1.仅删除fast_order类别上的重复项 2.如果fast_order连续出现多个,则每个连续中只保留一个(不是每个用户ID) 3.删除重复项是保持第一项出现

数据

User_id   event_name        timestamp
1         process_now       08:00:01
1         process_now       08:00:02
1         process_now       08:00:03
1         fast_order        08:00:04
1         fast_order        08:00:05
1         process_now       08:00:06
2         process_now       08:00:01
2         process_now       08:00:02
2         fast_order        08:00:03
2         fast_order        08:00:04
2         fast_order        08:00:05
2         process_now       08:00:06
2         fast_order        08:00:07
2         fast_order        08:00:08
2         process_now       08:00:09

我需要展示的是

User_id   Event_name        timestamp
1         process_now       08:00:01
1         process_now       08:00:02
1         process_now       08:00:03
1         fast_order        08:00:04
1         process_now       08:00:06
2         process_now       08:00:01
2         process_now       08:00:02
2         fast_order        08:00:03
2         process_now       08:00:06
2         fast_order        08:00:07
2         process_now       08:00:09

我应该怎么做?

1 个答案:

答案 0 :(得分:2)

每两列使用DataFrame.duplicated,以获取连续的组,逆条件,并按|进行按位OR的cchain检验条件,如果不等于fast_order

g = df['event_name'].ne(df['event_name'].shift()).cumsum()
df = df[df['event_name'].ne('fast_order') | ~df.assign(g=g).duplicated(['User_id','g'])]
print (df)
    User_id   event_name timestamp
0         1  process_now  08:00:01
1         1  process_now  08:00:02
2         1  process_now  08:00:03
3         1   fast_order  08:00:04
5         1  process_now  08:00:06
6         2  process_now  08:00:01
7         2  process_now  08:00:02
8         2   fast_order  08:00:03
11        2  process_now  08:00:06
12        2   fast_order  08:00:07
14        2  process_now  08:00:09

详细信息

print (df.assign(g=g))
    User_id   event_name timestamp  g
0         1  process_now  08:00:01  1
1         1  process_now  08:00:02  1
2         1  process_now  08:00:03  1
3         1   fast_order  08:00:04  2
5         1  process_now  08:00:06  3
6         2  process_now  08:00:01  3
7         2  process_now  08:00:02  3
8         2   fast_order  08:00:03  4
11        2  process_now  08:00:06  5
12        2   fast_order  08:00:07  6
14        2  process_now  08:00:09  7

print (df.assign(g=g).duplicated(['User_id','g']))
0     False
1      True
2      True
3     False
5     False
6     False
7      True
8     False
11    False
12    False
14    False
dtype: bool

print (~df.assign(g=g).duplicated(['User_id','g']))
0      True
1     False
2     False
3      True
5      True
6      True
7     False
8      True
11     True
12     True
14     True
dtype: bool