根据连续行之间的时间差标记重复项

时间:2019-06-09 17:46:19

标签: python pandas dataframe

银行数据框(DF)中存在重复的事务。 ID是客户ID。重复交易是一次多次刷卡,其中供应商在短时间内(此处为2分钟)意外地多次向客户的卡收费。

DF = pd.DataFrame({'ID': ['111', '111', '111','111', '222', '222', '222', '333', '333', '333', '333','111'],'Dollar': [1,3,1,10, 25, 8, 25,9,20, 9, 9,10],'transactionDateTime': ['2016-01-08 19:04:50', '2016-01-29 19:03:55', '2016-01-08 19:05:50', '2016-01-08 20:08:50', '2016-01-08 19:04:50', '2016-02-08 19:04:50', '2016-03-08 19:04:50', '2016-01-08 19:04:50', '2016-03-08 19:05:53', '2016-01-08 19:03:20', '2016-01-08 19:02:15', '2016-02-08 20:08:50']})
DF['transactionDateTime'] = pd.to_datetime(DF['transactionDateTime'])

    ID  Dollar  transactionDateTime
0   111     1   2016-01-08 19:04:50
1   111     3   2016-01-29 19:03:55
2   111     1   2016-01-08 19:05:50
3   111     10  2016-01-08 20:08:50
4   222     25  2016-01-08 19:04:50
5   222     8   2016-02-08 19:04:50
6   222     25  2016-03-08 19:04:50
7   333     9   2016-01-08 19:04:50
8   333     20  2016-03-08 19:05:53
9   333     9   2016-01-08 19:03:20
10  333     9   2016-01-08 19:02:15
11  111     10  2016-02-08 20:08:50

我想在数据框中添加一列,以识别重复的交易(相同客户ID的美元金额应相同,交易日期时间应少于2分钟)。请认为第一笔交易是“正常”交易。

    ID  Dollar  transactionDateTime     Duplicated?
0   111     1   2016-01-08 19:04:50     No
1   111     3   2016-01-29 19:03:55     No
2   111     1   2016-01-08 19:05:50     Yes
3   111     10  2016-01-08 20:08:50     No
4   222     25  2016-01-08 19:04:50     No
5   222     8   2016-02-08 19:04:50     No
6   222     25  2016-03-08 19:04:50     No
7   333     9   2016-01-08 19:04:50     Yes
8   333     20  2016-03-08 19:05:53     No
9   333     9   2016-01-08 19:03:20     Yes
10  333     9   2016-01-08 19:02:15     No
11  111     10  2016-02-08 20:08:50     No

4 个答案:

答案 0 :(得分:3)

IIUC,您可以groupbydiff检查连续事务之间的差异是否小于120秒:

df['Duplicated?'] = (df.sort_values(['transactionDateTime'])
                       .groupby(['ID', 'Dollar'], sort=False)['transactionDateTime']
                       .diff()
                       .dt.total_seconds()
                       .lt(120))
df

     ID  Dollar transactionDateTime  Duplicated?
0   111       1 2016-01-08 19:04:50        False
1   111       3 2016-01-29 19:03:55        False
2   111       1 2016-01-08 19:05:50         True
3   111     100 2016-01-08 20:08:50        False
4   222      25 2016-01-08 19:04:50        False
5   222       8 2016-02-08 19:04:50        False
6   222      25 2016-03-08 19:04:50        False
7   333       9 2016-01-08 19:04:50         True
8   333      20 2016-03-08 19:05:53        False
9   333       9 2016-01-08 19:03:20         True
10  333       9 2016-01-08 19:02:15        False
11  111     100 2016-02-08 20:08:50        False

请注意,您的数据未排序,因此必须首先对其进行排序才能获得有意义的结果。

答案 1 :(得分:2)

您可以使用:

m=(DF.groupby('customerID')['transactionDateTime'].diff()/ np.timedelta64(1, 'm')).le(2)
DF['Duplicated?']=np.where((DF.Dollar.duplicated()&m),'Yes','No')
print(DF)

   customerID  Dollar transactionDateTime Duplicated?
0         111       1 2016-01-08 19:04:50          No
1         111       3 2016-01-29 19:03:55          No
2         111       1 2016-01-08 19:05:50         Yes
3         111     100 2016-01-08 20:08:50          No
4         222      25 2016-01-08 19:04:50          No
5         222       8 2016-02-08 19:04:50          No
6         222      25 2016-03-08 19:04:50          No
7         333       9 2016-01-08 19:04:50          No
8         333      20 2016-03-08 19:05:53          No
9         333       9 2016-01-08 19:03:20         Yes
10        333       9 2016-01-08 19:02:15         Yes
11        111     100 2016-02-08 20:08:50          No

答案 2 :(得分:1)

我们可以首先在您的Dollar列中标记重复付款。然后,如果差异少于2分钟,请为每个客户评分:

DF.sort_values(['customerID', 'transactionDateTime'], inplace=True)

m1 = DF.groupby('customerID', sort=False)['Dollar'].apply(lambda x: x.duplicated())
m2 = DF.groupby('customerID', sort=False)['transactionDateTime'].diff() <= pd.Timedelta(2, unit='minutes')

DF['Duplicated?'] = np.where(m1 & m2, 'Yes', 'No')

   customerID  Dollar transactionDateTime Duplicated?
0         111       1 2016-01-08 19:04:50          No
1         111       1 2016-01-08 19:05:50         Yes
2         111     100 2016-01-08 20:08:50          No
3         111       3 2016-01-29 19:03:55          No
4         111     100 2016-02-08 20:08:50          No
5         222      25 2016-01-08 19:04:50          No
6         222       8 2016-02-08 19:04:50          No
7         222      25 2016-03-08 19:04:50          No
8         333       9 2016-01-08 19:02:15          No
9         333       9 2016-01-08 19:03:20         Yes
10        333       9 2016-01-08 19:04:50         Yes
11        333      20 2016-03-08 19:05:53          No

答案 3 :(得分:1)

我制作了pd.Timedelta(minutes=2)diff()

m2 = pd.Timedelta(minutes=2)    
DF['dup'] = DF.sort_values('transactionDateTime').groupby(['Dollar','ID']).transactionDateTime.diff().abs().le(m2).astype(int)


Out[272]:
    Dollar   ID transactionDateTime  dup
0        1  111 2016-01-08 19:04:50    0
1        3  111 2016-01-29 19:03:55    0
2        1  111 2016-01-08 19:05:50    1
3      100  111 2016-01-08 20:08:50    0
4       25  222 2016-01-08 19:04:50    0
5        8  222 2016-02-08 19:04:50    0
6       25  222 2016-03-08 19:04:50    0
7        9  333 2016-01-08 19:04:50    1
8       20  333 2016-03-08 19:05:53    0
9        9  333 2016-01-08 19:03:20    1
10       9  333 2016-01-08 19:02:15    0
11     100  111 2016-02-08 20:08:50    0