当键值略有不同时合并2个数据帧

时间:2019-12-07 17:13:26

标签: python pandas

我想合并2个数据帧,问题是我使用的键不包含完全相同的值。例如,这就是df1的样子

name                                    val3       
Wilder Deontay                           1
Fury Tyson                               2
Ortiz Luis                               3
Joshua Olaseni Oluwafemi Anthony         4

和df2

name1                        val       
Deontay Wilder               19
Tyson Fury                   20  
Luis Ortiz                   21
Anthony Joshua               10

预期输出是两个数据框的合并,因此

name1                      val          val3
Deontay Wilder             19             1
Tyson Fury                 20             2
Luis Ortiz                 21             3
Anthony Joshua             10             4

1 个答案:

答案 0 :(得分:0)

这是我的解决办法,

>>> import pandas as pd
>>> from fuzzywuzzy import fuzz
>>> data = {
    'name': ['Wilder Deontay', 'Fury Tyson', 'Ortiz Luis', 'Joshua Olaseni Oluwafemi Anthony'],
    'val3': [1, 2, 3, 4]
}... ... ...
>>> df1 = pd.DataFrame(data)
>>> data2 = {
    'name1': ['Deontay Wilder', 'Tyson Fury', 'Luis Ortiz ', 'Anthony Joshua'],
    'val': [19, 20, 21, 10]
}... ... ...
>>> df2 = pd.DataFrame(data2)
>>> df1['key'] = 1
>>> df2['key'] = 1
>>> merged = df1.merge(df2, on='key')
>>> merged['similarity'] = merged.apply(lambda row: fuzz.token_set_ratio(row['name'], row['name1']), axis=1)
>>> merged[merged.similarity == 100][['name1', 'val', 'val3']]
             name1  val  val3
0   Deontay Wilder   19     1
5       Tyson Fury   20     2
10     Luis Ortiz    21     3
15  Anthony Joshua   10     4

首先,我进行交叉合并,然后看一下相似性。有关fuzzywuzzytoken_set_ratio的详细信息:https://stackoverflow.com/a/31823872/8205554

或者您可以使用fuzzymatcher

>>> from fuzzymatcher import fuzzy_left_join
>>> fuzzy_left_join(df1, df2, 'name', 'name1')[['name1', 'val', 'val3']]
            name1  val  val3
0  Deontay Wilder   19     1
1      Tyson Fury   20     2
2     Luis Ortiz    21     3
3  Anthony Joshua   10     4