熊猫:一列的近似连接,其他列的精确匹配

时间:2016-05-31 11:37:56

标签: python pandas merge nearest-neighbor exact-match

我有两个pandas数据帧我想在一个(日期)列上的多个列(比如说3)和大约(即最近邻)上完全加入/合并。我还想返回它们之间的差异(天数)。每个数据集大约50,000行。我最感兴趣的是一个内部联盟,但“残羹剩饭”也很有趣,如果不是很难掌握。大多数“完全匹配”观察将在每个数据框中存在多次。

我一直在尝试将 difflib.get_close_matches 连接在一起作为字符串连接(这是愚蠢的,我知道!)但并不总是给出完全匹配。我想我需要首先循环完全匹配,然后在这个组中找到最接近的匹配,但我似乎无法正确对待......

数据框看起来像:

df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')

df1
Out[430]: 
       col1   col2 col3        date
index                              
a1     1232    asd    1  2010-01-23
a2      432  dsa12    2  2016-05-20
a3      432  dsa12    2  2010-06-20
a4      123   asd2    3  2008-10-21

df2 = pd.DataFrame({'index': ['b1','b2','b3','b4'], 'col1': ['132','432','432','123'], 'col2': ['asd','dsa12','dsa12','sd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-23','2010-06-10','2008-10-21'],}).set_index('index')

df2
Out[434]: 
      col1   col2 col3        date
index                             
b1     132    asd    1  2010-01-23
b2     432  dsa12    2  2016-05-23
b3     432  dsa12    2  2010-06-10
b4     123    sd2    3  2008-10-21

最后我想要的是:

       col1   col2 col3        date diff match_index
index                              
a1     1232    asd    1  2010-01-23  nan         nan
a2      432  dsa12    2  2016-05-20   -3          b2
a3      432  dsa12    2  2010-06-20   10          b3
a4      123   asd2    3  2008-10-21  nan         nan
a5      123    sd2    3  2008-10-21  nan          b4

或者只是通过内部联接更容易,我喜欢:

       col1   col2 col3        date diff match_index
index                                                     
a2      432  dsa12    2  2016-05-20   -3          b2
a3      432  dsa12    2  2010-06-20   10          b3

1 个答案:

答案 0 :(得分:2)

Hej mate,

我不确定这是否合适。它实现了你想要的或多或少,但实际上并没有执行合并。它遵循与此question相同的想法,除了仅基于一列对df1进行子集化,此处我们使用groupby在多个列上进行匹配,并在两个数据帧上进行。如果您确实希望明确包含merge命令并对内部联接感到满意,那么请检查答案的最底部,它包含一个代码段。

将pandas导入为pd 来自sklearn.neighbors导入NearestNeighbors

def find_nearest(group, df2, groupname):
    try:
        match = df2.groupby(groupname).get_group(group.name)
        match['date'] = pd.to_datetime(match.date, unit = 'D')
        nbrs = NearestNeighbors(1).fit(match['date'].values[:, None])
        dist, ind = nbrs.kneighbors(group['date'].values[:, None])

        group['date1'] = group['date']
        group['date'] = match['date'].values[ind.ravel()]
        group['diff'] = (group['date1']-group['date'])
        group['match_index'] = match.index[ind.ravel()]
        return group
    except KeyError:
        return group

#change dates from string to datetime
df1['date'] = pd.to_datetime(df1.date, unit = 'D')
df2['date'] = pd.to_datetime(df2.date, unit = 'D')

#find closest dates and differences
keys = ['col1', 'col2', 'col3']
df1_mod = df1.groupby(keys).apply(find_nearest, df2, keys)

#fill unmatched dates 
df1_mod.date1.fillna(df1_mod.date, inplace=True)

df2_mod = df2.groupby(keys).apply(find_nearest, df1, keys) 
df2_mod.date1.fillna(df2_mod.date, inplace=True)

#drop original column 
df1_mod.drop('date', inplace=True, axis=1)
df1_mod.rename(columns = {'date1':'date'}, inplace=True)

df2_mod.drop('date', inplace=True, axis=1)
df2_mod.rename(columns = {'date1':'date'}, inplace=True)
df2_mod['diff'] = -df2_mod['diff']

#drop redundant values
df2_mod.drop(df2_mod[df2_mod.match_index.str.len()>0].index, inplace=True)

#merge the two 
df_final = pd.merge(df1_mod, df2_mod, how='outer')

这产生以下结果:

In [349]: df_final
Out[349]:
   col1   col2 col3       date    diff match_index
0  1232    asd    1 2010-01-23     NaT         NaN
1   432  dsa12    2 2016-05-20 -3 days          b2
2   432  dsa12    2 2010-06-20 10 days          b3
3   123   asd2    3 2008-10-21     NaT         NaN
4   132    asd    1 2010-01-23     NaT         NaN
5   123    sd2    3 2008-10-21     NaT         NaN

使用merge命令:

In [208]: pd.merge(df1_mod, df2.drop('date', axis=1), on=['col1', 'col2', 'col3']).drop_duplicates()
Out[208]:
  col1   col2 col3       date    diff match_index
0  432  dsa12    2 2016-05-20 -3 days          b2
2  432  dsa12    2 2010-06-20 10 days          b3

评论中考虑的案例,即:

df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','1432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')

产生以下结果:

In [351]: df_final
Out[351]:
   col1   col2 col3       date    diff match_index
0  1232    asd    1 2010-01-23     NaT         NaN
1  1432  dsa12    2 2016-05-20     NaT         NaN
2   432  dsa12    2 2010-06-20 10 days          b3
3   123   asd2    3 2008-10-21     NaT         NaN
4   132    asd    1 2010-01-23     NaT         NaN
5   123    sd2    3 2008-10-21     NaT         NaN