我想要根据列合并两个DataFrame。然而,由于交替拼写,不同数量的空格,不存在/存在变音符号,我希望能够合并,只要它们彼此相似。
任何相似性算法都可以(soundex,Levenshtein,difflib)。
假设一个DataFrame具有以下数据:
df1 = DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
number
one 1
two 2
three 3
four 4
five 5
df2 = DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
letter
one a
too b
three c
fours d
five e
然后我想得到生成的DataFrame
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
答案 0 :(得分:60)
与@locojay建议类似,您可以将difflib
的get_close_matches
应用于df2
的索引,然后应用join
:
In [23]: import difflib
In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>
In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
In [26]: df2
Out[26]:
letter
one a
two b
three c
four d
five e
In [31]: df1.join(df2)
Out[31]:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
如果这些是列,则可以按照相同的方式应用于列merge
:
df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])
df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)
答案 1 :(得分:10)
我写了一个旨在解决这个问题的Python包:
pip install fuzzymatcher
基本用法:
给定两个要模糊连接的数据框df_left
和df_right
,您可以编写以下内容:
from fuzzymatcher import link_table, fuzzy_left_join
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
或者,如果您只想链接最接近的匹配项:
fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on, right_on)
答案 2 :(得分:9)
我会使用Jaro-Winkler,因为它是当前可用的最高效且最准确的近似字符串匹配算法[Cohen, et al.],[Winkler]。
这就是我使用jellyfish包中的Jaro-Winkler所做的事情:
def get_closest_match(x, list_strings):
best_match = None
highest_jw = 0
for current_string in list_strings:
current_score = jellyfish.jaro_winkler(x, current_string)
if(current_score > highest_jw):
highest_jw = current_score
best_match = current_string
return best_match
df1 = pandas.DataFrame([[1],[2],[3],[4],[5]], index=['one','two','three','four','five'], columns=['number'])
df2 = pandas.DataFrame([['a'],['b'],['c'],['d'],['e']], index=['one','too','three','fours','five'], columns=['letter'])
df2.index = df2.index.map(lambda x: get_closest_match(x, df1.index))
df1.join(df2)
输出:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e
答案 3 :(得分:5)
http://pandas.pydata.org/pandas-docs/dev/merging.html没有钩子函数来动态执行此操作。虽然很好......
我只是做一个单独的步骤并使用difflib getclosest_matches在2个数据框之一中创建一个新列,并在模糊匹配列上创建合并/连接
答案 4 :(得分:3)
fuzzy_merge
对于更常见的情况,我们要合并两个包含略有不同的字符串的数据帧中的列,以下函数将difflib.get_close_matches
和merge
结合使用,以模拟熊猫的{ {1}},但具有模糊匹配:
merge
以下是一些带有两个示例数据帧的用例:
import difflib
def fuzzy_merge(df1, df2, left_on, right_on, how='inner', cutoff=0.6):
df_other= df2.copy()
df_other[left_on] = [get_closest_match(x, df1[left_on], cutoff)
for x in df_other[right_on]]
return df1.merge(df_other, on=left_on, how=how)
def get_closest_match(x, other, cutoff):
matches = difflib.get_close_matches(x, other, cutoff=cutoff)
return matches[0] if matches else None
在上面的示例中,我们得到:
print(df1)
key number
0 one 1
1 two 2
2 three 3
3 four 4
4 five 5
print(df2)
key_close letter
0 three c
1 one a
2 too b
3 fours d
4 a very different string e
我们可以通过以下方式进行左联接:
fuzzy_merge(df1, df2, left_on='key', right_on='key_close')
key number key_close letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
对于左连接,我们将在左侧数据框中将所有不匹配的键都设为fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='left')
key number key_close letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
4 five 5 NaN NaN
:
None
还请注意,如果截止日期内没有匹配的项目,difflib.get_close_matches
将返回一个空列表。在共享的示例中,如果我们将fuzzy_merge(df1, df2, left_on='key', right_on='key_close', how='right')
key number key_close letter
0 one 1.0 one a
1 two 2.0 too b
2 three 3.0 three c
3 four 4.0 fours d
4 None NaN a very different string e
中的最后一个索引更改为:
df2
我们会收到print(df2)
letter
one a
too b
three c
fours d
a very different string e
错误:
index out of range
IndexError:列表索引超出范围
为解决此问题,上述功能df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])
将通过索引get_closest_match
返回的列表来返回最接近的匹配项。仅当实际上包含任何匹配项时。
答案 5 :(得分:2)
作为一个抬头,这基本上有效,除非找不到匹配,或者在任一列中都有NaN。我没有直接应用get_close_matches
,而是发现应用以下功能更容易。 NaN替代品的选择将在很大程度上取决于您的数据集。
def fuzzy_match(a, b):
left = '1' if pd.isnull(a) else a
right = b.fillna('2')
out = difflib.get_close_matches(left, right)
return out[0] if out else np.NaN
答案 6 :(得分:2)
有一个名为fuzzy_pandas
的程序包,可以使用levenshtein
,jaro
,metaphone
和bilenco
方法。列举了一些很棒的例子here
import pandas as pd
import fuzzy_pandas as fpd
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
results = fpd.fuzzy_merge(df1, df2,
left_on='Key',
right_on='Key',
method='levenshtein',
threshold=0.6)
results.head()
Key Key
0 Apple Aple
1 Banana Bannanna
2 Orange Orag
答案 7 :(得分:1)
您可以为此使用d6tjoin
import d6tjoin.top1
d6tjoin.top1.MergeTop1(df1.reset_index(),df2.reset_index(),
fuzzy_left_on=['index'],fuzzy_right_on=['index']).merge()['merged']
index number index_right letter
0 one 1 one a
1 two 2 too b
2 three 3 three c
3 four 4 fours d
4 five 5 five e
它具有多种其他功能,例如:
有关详细信息,请参见
答案 8 :(得分:0)
fuzzywuzzy
由于没有fuzzywuzzy
包的示例,这是我编写的一个函数,它将根据您可以设置为用户的阈值返回所有匹配项:
datframe示例
df1 = pd.DataFrame({'Key':['Apple', 'Banana', 'Orange', 'Strawberry']})
df2 = pd.DataFrame({'Key':['Aple', 'Mango', 'Orag', 'Straw', 'Bannanna', 'Berry']})
# df1
Key
0 Apple
1 Banana
2 Orange
3 Strawberry
# df2
Key
0 Aple
1 Mango
2 Orag
3 Straw
4 Bannanna
5 Berry
模糊匹配功能
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
'''
df_1 is the left table to join
df_2 is the right table to join
key1 is the key column of the left table
key2 is the key column of the right table
threshold is how close the matches should be to return a match
limit is the amount of matches will get returned, these are sorted high to low
'''
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
在数据框中使用我们的功能: #1
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
fuzzy_merge(df1, df2, 'Key', 'Key', threshold=80)
Key matches
0 Apple Aple
1 Banana Bannanna
2 Orange Orag
3 Strawberry Straw, Berry
在数据框上使用我们的功能 #2
df1 = pd.DataFrame({'Col1':['Microsoft', 'Google', 'Amazon', 'IBM']})
df2 = pd.DataFrame({'Col2':['Mcrsoft', 'gogle', 'Amason', 'BIM']})
fuzzy_merge(df1, df2, 'Col1', 'Col2', 80)
Col1 matches
0 Microsoft Mcrsoft
1 Google gogle
2 Amazon Amason
3 IBM
点
pip install fuzzywuzzy
Anaconda
conda install -c conda-forge fuzzywuzzy
答案 9 :(得分:0)
我使用了Fuzzymatcher软件包,这对我来说效果很好。请访问link,以获取更多详细信息。
使用以下命令进行安装
pip install fuzzymatcher
下面是示例代码(上面的RobinL已提交)
from fuzzymatcher import link_table, fuzzy_left_join
# Columns to match on from df_left
left_on = ["fname", "mname", "lname", "dob"]
# Columns to match on from df_right
right_on = ["name", "middlename", "surname", "date"]
# The link table potentially contains several matches for each record
fuzzymatcher.link_table(df_left, df_right, left_on, right_on)
您可能会遇到的错误
优点:
缺点:
答案 10 :(得分:0)
对于更复杂的用例以匹配具有许多列的行,可以使用recordlinkage
包。 recordlinkage
提供了所有工具来模糊匹配pandas
数据帧之间的行,这有助于在合并时对数据进行重复数据删除。我写了一篇关于软件包here
答案 11 :(得分:0)
如果连接轴是数字,这也可用于匹配具有指定容差的索引:
def fuzzy_left_join(df1, df2, tol=None):
index1 = df1.index.values
index2 = df2.index.values
diff = np.abs(index1.reshape((-1, 1)) - index2)
mask_j = np.argmin(diff, axis=1) # min. of each column
mask_i = np.arange(mask_j.shape[0])
df1_ = df1.iloc[mask_i]
df2_ = df2.iloc[mask_j]
if tol is not None:
mask = np.abs(df2_.index.values - df1_.index.values) <= tol
df1_ = df1_.loc[mask]
df2_ = df2_.loc[mask]
df2_.index = df1_.index
out = pd.concat([df1_, df2_], axis=1)
return out
答案 12 :(得分:0)
我以极少的方式使用了 async Signup
,同时匹配了 fuzzywuzz
中 merge
的现有行为和关键字。
只需指定您接受的 pandas
进行匹配(在 threshold
和 0
之间):
100
使用示例数据尝试一下:
from fuzzywuzzy import process
def fuzzy_merge(df, df2, on=None, left_on=None, right_on=None, how='inner', threshold=80):
def fuzzy_apply(x, df, column, threshold=threshold):
if type(x)!=str:
return None
match, score, *_ = process.extract(x, df[column], limit=1)[0]
if score >= threshold:
return match
else:
return None
if on is not None:
left_on = on
right_on = on
# create temp column as the best fuzzy match (or None!)
df2['tmp'] = df2[right_on].apply(
fuzzy_apply,
df=df,
column=left_on,
threshold=threshold
)
merged_df = df.merge(df2, how=how, left_on=left_on, right_on='tmp')
del merged_df['tmp']
return merged_df