我想合并2个数据帧,问题是我使用的键不包含完全相同的值。例如,这就是df1的样子
name val3
Wilder Deontay 1
Fury Tyson 2
Ortiz Luis 3
Joshua Olaseni Oluwafemi Anthony 4
和df2
name1 val
Deontay Wilder 19
Tyson Fury 20
Luis Ortiz 21
Anthony Joshua 10
预期输出是两个数据框的合并,因此
name1 val val3
Deontay Wilder 19 1
Tyson Fury 20 2
Luis Ortiz 21 3
Anthony Joshua 10 4
答案 0 :(得分:0)
这是我的解决办法,
>>> import pandas as pd
>>> from fuzzywuzzy import fuzz
>>> data = {
'name': ['Wilder Deontay', 'Fury Tyson', 'Ortiz Luis', 'Joshua Olaseni Oluwafemi Anthony'],
'val3': [1, 2, 3, 4]
}... ... ...
>>> df1 = pd.DataFrame(data)
>>> data2 = {
'name1': ['Deontay Wilder', 'Tyson Fury', 'Luis Ortiz ', 'Anthony Joshua'],
'val': [19, 20, 21, 10]
}... ... ...
>>> df2 = pd.DataFrame(data2)
>>> df1['key'] = 1
>>> df2['key'] = 1
>>> merged = df1.merge(df2, on='key')
>>> merged['similarity'] = merged.apply(lambda row: fuzz.token_set_ratio(row['name'], row['name1']), axis=1)
>>> merged[merged.similarity == 100][['name1', 'val', 'val3']]
name1 val val3
0 Deontay Wilder 19 1
5 Tyson Fury 20 2
10 Luis Ortiz 21 3
15 Anthony Joshua 10 4
首先,我进行交叉合并,然后看一下相似性。有关fuzzywuzzy
和token_set_ratio
的详细信息:https://stackoverflow.com/a/31823872/8205554
或者您可以使用fuzzymatcher
,
>>> from fuzzymatcher import fuzzy_left_join
>>> fuzzy_left_join(df1, df2, 'name', 'name1')[['name1', 'val', 'val3']]
name1 val val3
0 Deontay Wilder 19 1
1 Tyson Fury 20 2
2 Luis Ortiz 21 3
3 Anthony Joshua 10 4