我有两个pandas数据帧我想在一个(日期)列上的多个列(比如说3)和大约(即最近邻)上完全加入/合并。我还想返回它们之间的差异(天数)。每个数据集大约50,000行。我最感兴趣的是一个内部联盟,但“残羹剩饭”也很有趣,如果不是很难掌握。大多数“完全匹配”观察将在每个数据框中存在多次。
我一直在尝试将 difflib.get_close_matches 连接在一起作为字符串连接(这是愚蠢的,我知道!)但并不总是给出完全匹配。我想我需要首先循环完全匹配,然后在这个组中找到最接近的匹配,但我似乎无法正确对待......
数据框看起来像:
df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')
df1
Out[430]:
col1 col2 col3 date
index
a1 1232 asd 1 2010-01-23
a2 432 dsa12 2 2016-05-20
a3 432 dsa12 2 2010-06-20
a4 123 asd2 3 2008-10-21
df2 = pd.DataFrame({'index': ['b1','b2','b3','b4'], 'col1': ['132','432','432','123'], 'col2': ['asd','dsa12','dsa12','sd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-23','2010-06-10','2008-10-21'],}).set_index('index')
df2
Out[434]:
col1 col2 col3 date
index
b1 132 asd 1 2010-01-23
b2 432 dsa12 2 2016-05-23
b3 432 dsa12 2 2010-06-10
b4 123 sd2 3 2008-10-21
最后我想要的是:
col1 col2 col3 date diff match_index
index
a1 1232 asd 1 2010-01-23 nan nan
a2 432 dsa12 2 2016-05-20 -3 b2
a3 432 dsa12 2 2010-06-20 10 b3
a4 123 asd2 3 2008-10-21 nan nan
a5 123 sd2 3 2008-10-21 nan b4
或者只是通过内部联接更容易,我喜欢:
col1 col2 col3 date diff match_index
index
a2 432 dsa12 2 2016-05-20 -3 b2
a3 432 dsa12 2 2010-06-20 10 b3
答案 0 :(得分:2)
Hej mate,
我不确定这是否合适。它实现了你想要的或多或少,但实际上并没有执行合并。它遵循与此question相同的想法,除了仅基于一列对df1
进行子集化,此处我们使用groupby
在多个列上进行匹配,并在两个数据帧上进行。如果您确实希望明确包含merge
命令并对内部联接感到满意,那么请检查答案的最底部,它包含一个代码段。
将pandas导入为pd 来自sklearn.neighbors导入NearestNeighbors
def find_nearest(group, df2, groupname):
try:
match = df2.groupby(groupname).get_group(group.name)
match['date'] = pd.to_datetime(match.date, unit = 'D')
nbrs = NearestNeighbors(1).fit(match['date'].values[:, None])
dist, ind = nbrs.kneighbors(group['date'].values[:, None])
group['date1'] = group['date']
group['date'] = match['date'].values[ind.ravel()]
group['diff'] = (group['date1']-group['date'])
group['match_index'] = match.index[ind.ravel()]
return group
except KeyError:
return group
#change dates from string to datetime
df1['date'] = pd.to_datetime(df1.date, unit = 'D')
df2['date'] = pd.to_datetime(df2.date, unit = 'D')
#find closest dates and differences
keys = ['col1', 'col2', 'col3']
df1_mod = df1.groupby(keys).apply(find_nearest, df2, keys)
#fill unmatched dates
df1_mod.date1.fillna(df1_mod.date, inplace=True)
df2_mod = df2.groupby(keys).apply(find_nearest, df1, keys)
df2_mod.date1.fillna(df2_mod.date, inplace=True)
#drop original column
df1_mod.drop('date', inplace=True, axis=1)
df1_mod.rename(columns = {'date1':'date'}, inplace=True)
df2_mod.drop('date', inplace=True, axis=1)
df2_mod.rename(columns = {'date1':'date'}, inplace=True)
df2_mod['diff'] = -df2_mod['diff']
#drop redundant values
df2_mod.drop(df2_mod[df2_mod.match_index.str.len()>0].index, inplace=True)
#merge the two
df_final = pd.merge(df1_mod, df2_mod, how='outer')
这产生以下结果:
In [349]: df_final
Out[349]:
col1 col2 col3 date diff match_index
0 1232 asd 1 2010-01-23 NaT NaN
1 432 dsa12 2 2016-05-20 -3 days b2
2 432 dsa12 2 2010-06-20 10 days b3
3 123 asd2 3 2008-10-21 NaT NaN
4 132 asd 1 2010-01-23 NaT NaN
5 123 sd2 3 2008-10-21 NaT NaN
使用merge命令:
In [208]: pd.merge(df1_mod, df2.drop('date', axis=1), on=['col1', 'col2', 'col3']).drop_duplicates()
Out[208]:
col1 col2 col3 date diff match_index
0 432 dsa12 2 2016-05-20 -3 days b2
2 432 dsa12 2 2010-06-20 10 days b3
评论中考虑的案例,即:
df1 = pd.DataFrame({'index': ['a1','a2','a3','a4'], 'col1': ['1232','1432','432','123'], 'col2': ['asd','dsa12','dsa12','asd2'], 'col3': ['1','2','2','3'], 'date': ['2010-01-23','2016-05-20','2010-06-20','2008-10-21'],}).set_index('index')
产生以下结果:
In [351]: df_final
Out[351]:
col1 col2 col3 date diff match_index
0 1232 asd 1 2010-01-23 NaT NaN
1 1432 dsa12 2 2016-05-20 NaT NaN
2 432 dsa12 2 2010-06-20 10 days b3
3 123 asd2 3 2008-10-21 NaT NaN
4 132 asd 1 2010-01-23 NaT NaN
5 123 sd2 3 2008-10-21 NaT NaN