鉴于满足两个标准公差,我正在尝试匹配两个数据帧中的行。是否可以使用pandas merge_asof完成此任务?并且是有可能做左合并,而不使用行从正确的数据帧不止一次?
我已经能够匹配date_time值相差少于10分钟的行,并且可以通过索引和排序date_time字段来做到这一点。到目前为止,这工作,但我不知道如何添加另一个标准和宽容。我还没有确定如何不多次合并正确的数据框行。
对此表示感谢。
import pandas as pd
from pandas import read_csv
from io import StringIO
### DATA FRAME INPUTS ###
a = '''
date_time,phone, duration
01/01/2016 02:03:00,2065539023,1
01/01/2016 11:01:00,2065539023,21
01/01/2016 11:02:00,2065539023,27
01/01/2016 14:02:00,2065539030,5
'''
b = '''
date_time,phone, duration
01/01/2016 02:08:00,2065539023,1
01/01/2016 19:04:00,2065539022,20
01/01/2016 11:05:00,2065539023,25
01/01/2016 14:03:00,2065539030,6
'''
### DATE PARSING ###
df1 = read_csv(StringIO(a), parse_dates=['date'])
df2 = read_csv(StringIO(b), parse_dates=['date'])
### NOT SURE WHAT THIS DOES ###
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
### SORT DATES IN ASCENDING ORDER ###
df1 = df1.sort_values('date',ascending=True)
df2 = df2.sort_values('date',ascending=True)
# converting this to the index so we can preserve the
date_start_time columns so you can validate the merging logic
df1.index = df1['date']
df2.index = df2['date']
# the magic happens below, check the direction and tolerance
arguments
tol = pd.Timedelta('10 minute')
df3 = pd.merge_asof(
left=df1,
right=df2,
left_index=True,
right_index=True,
direction='nearest',
tolerance=tol)
### PRINT RESULTS ###
print(df3)
我希望电话号码字段完全匹配,如果date_time值相隔不到10分钟(优先级1),则需要行匹配;如果持续时间小于相隔10秒(优先级2)。
phone_x, phone_y, date_x, date_y, duration_x, duration_y
2065539023, 2065539023, 01/01/2016 02:03:00, 01/01/2016 02:08:00, 1, 1
2065539023, NaN, 01/01/2016 11:01:00, NaN, 21, Nan
2065539023, 2065539023, 01/01/2016 11:02:00, 01/01/2016 11:05:00, 27, 25
2065539030, 2065539030, 01/01/2016 14:02:00, 01/01/2016 14:03:00, 5, 6