如何标记熊猫DataFrame中的最后一个重复元素

时间:2019-04-10 08:44:10

标签: python pandas

您知道有一种方法% Input x = [1 2 3 4 5]' y = [6 7 8 9 10]' % Position pos = 8; % Add some code for checking numel(x) >= pos here... % Output z = [x; NaN(pos-numel(x)-1, 1); y] x = 1 2 3 4 5 y = 6 7 8 9 10 z = 1 2 3 4 5 NaN NaN 6 7 8 9 10 可以在列中查找重复项,但我需要的是知道我的数据按日期排序的最后一个重复元素。

这是列.duplicated的预期结果Last_dup

Policy_id

提前感谢您的帮助和支持!

2 个答案:

答案 0 :(得分:2)

Series.duplicatedDataFrame.duplicated与指定的列和参数keep='last'一起使用,然后将True/False1/0的映射中将倒置掩码转换为整数或使用{{3 }}:

df['Last_dup1'] = (~df['Policy_id'].duplicated(keep='last')).astype(int)
df['Last_dup1'] = np.where(df['Policy_id'].duplicated(keep='last'), 0, 1)

或者:

df['Last_dup1'] = (~df.duplicated(subset=['Policy_id'], keep='last')).astype(int)
df['Last_dup1'] = np.where(df.duplicated(subset=['Policy_id'], keep='last'), 0, 1)

print (df)
   Id Policy_id  Start_Date  Last_dup  Last_dup1
0   0      b123  2019/02/24         0          0
1   1      b123  2019/03/24         0          0
2   2      b123  2019/04/24         1          1
3   3      c123  2018/09/01         0          0
4   4      c123  2018/10/01         1          1
5   5      d123  2017/02/24         0          0
6   6      d123  2017/03/24         1          1

答案 1 :(得分:0)

也可以通过下面提到的方式(不使用Series.duplicated来完成):

dictionary = df[['Id','Policy_id']].set_index('Policy_id').to_dict()['Id']
#here the dictionary values contains the most recent Id's
df['Last_dup'] = df.Id.apply(lambda x: 1 if x in list(dictionary.values()) else 0)