根据两个条件的容差合并两个数据帧中的值?

时间:2019-02-02 21:43:25

标签: python pandas dataframe merge

鉴于满足两个标准公差,我正在尝试匹配两个数据帧中的行。是否可以使用pandas merge_asof完成此任务?并且是有可能做左合并,而不使用行从正确的数据帧不止一次?

我已经能够匹配date_time值相差少于10分钟的行,并且可以通过索引和排序date_time字段来做到这一点。到目前为止,这工作,但我不知道如何添加另一个标准和宽容。我还没有确定如何不多次合并正确的数据框行。

对此表示感谢。

import pandas as pd
from pandas import read_csv
from io import StringIO


### DATA FRAME INPUTS ###
a = '''
date_time,phone, duration
01/01/2016 02:03:00,2065539023,1
01/01/2016 11:01:00,2065539023,21
01/01/2016 11:02:00,2065539023,27
01/01/2016 14:02:00,2065539030,5
'''

b = '''
date_time,phone, duration
01/01/2016 02:08:00,2065539023,1
01/01/2016 19:04:00,2065539022,20
01/01/2016 11:05:00,2065539023,25
01/01/2016 14:03:00,2065539030,6
'''

### DATE PARSING ###
df1 = read_csv(StringIO(a), parse_dates=['date'])
df2 = read_csv(StringIO(b), parse_dates=['date'])

### NOT SURE WHAT THIS DOES ###
df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])

### SORT DATES IN ASCENDING ORDER ###
df1 = df1.sort_values('date',ascending=True)
df2 = df2.sort_values('date',ascending=True)

# converting this to the index so we can preserve the 
date_start_time columns so you can validate the merging logic
df1.index = df1['date']
df2.index = df2['date']

# the magic happens below, check the direction and tolerance 
arguments
tol = pd.Timedelta('10 minute')
df3 = pd.merge_asof(
left=df1,
right=df2,
left_index=True,
right_index=True,
direction='nearest',
tolerance=tol)

### PRINT RESULTS ###
print(df3)

我希望电话号码字段完全匹配,如果date_time值相隔不到10分钟(优先级1),则需要行匹配;如果持续时间小于相隔10秒(优先级2)。

phone_x, phone_y, date_x, date_y, duration_x, duration_y
2065539023, 2065539023, 01/01/2016 02:03:00, 01/01/2016 02:08:00, 1, 1
2065539023, NaN, 01/01/2016 11:01:00, NaN, 21, Nan
2065539023, 2065539023, 01/01/2016 11:02:00, 01/01/2016 11:05:00, 27, 25
2065539030, 2065539030, 01/01/2016 14:02:00, 01/01/2016 14:03:00, 5, 6

0 个答案:

没有答案