Question

"Table 1"          -         "Table 2"          -      "Intended Table"             
"5 Smith Rd"        -        "5 Smith Rd"        -     "5 Smith Rd"             
"7 John Rd"            -      "7 John Rd"        -     "7 John Rd"              
"Ft 1, 7 James Rd"      -    "7 James Rd"         -   "Flat 1, 7 James Rd"              
"Flat 1, 7 Smith street" -  "FT1, 7 Smith Street" - "Flat 1, 7 Smith Street"

这只是一个示例表，实际表遵循类似的格式，但是有70k行。

我正在使用熊猫来合并数据集，我正在通过“地址”列进行合并。大多数地址的格式相同，因此匹配。但是，如示例所示，某些地址列由于格式不匹配。我想知道的是：如果列匹配度达到80％或类似，是否可以合并列？这让我很沮丧，因为我需要合并表，但是手动进行将需要数周的时间。

合并时，我得到了很多匹配项，但是缺少那些不遵循相同格式的匹配项，因此它破坏了我的最终数据集。

Answer 1

据我所知，pandas.merge（）没有提供任何匹配子集的选项。一种方法是先使用regexp并替换表，然后合并。但是，此方法的适用性取决于条目的不同格式的可预测性。

有关适用于示例数据集的某些代码，请参见下文，但70k数据集中可能存在边缘情况。这里假定任何带小数点的地址都将遵循模式“ f [some_charaters] [a digit] [逗号]”，并将其替换为“ flat [the digit] [逗号]”。

import re
import pdb

table_1 = pd.DataFrame({'ADDRESS':  ["5 Smith Rd" , "7 John Rd", "Ft 1, 7 James Rd", "Flat 1, 7 Smith street"]})
table_2 = pd.DataFrame({ 'ADDRESS': ["5 Smith Rd", "7 John Rd", "7 James Rd", "FT1, 7 Smith Street"]})

# make all lowercase to avoid case formatting issue
table_1['ADDRESS'] = table_1['ADDRESS'].str.lower()
table_2['ADDRESS'] = table_2['ADDRESS'].str.lower()

# Find the pattern in the string that starts with f, has any number of characters, and then has the pattern d, where d
# is any digit. If this pattern exists, replace with flat.
table_1['ADDRESS'] = table_1['ADDRESS'].replace({'(?:^f.*)(\d,)': r'flat \1'}, regex=True)
table_2['ADDRESS'] = table_2['ADDRESS'].replace({'(?:^f.*)(\d,)': r'flat \1'}, regex=True)

# Now there is an issue that some addresses have the flat while others just have the roadname.
# Here iterate through both dataframes and if one entry has the flat while the other doesn't, replace
# the one that doens't with the one that does
regex = re.compile(r"flat \d,")
for idx, (t1_address, t2_address) in enumerate(zip(table_1['ADDRESS'],
                                                   table_2['ADDRESS'])):
    if re.match(regex, t1_address) and \
            not re.match(regex, t2_address):
        table_2['ADDRESS'][idx] =  table_1['ADDRESS'][idx]
        
    elif re.match(regex, t2_address) and \
            not re.match(regex, t1_address):
        table_1['ADDRESS'][idx] = table_2['ADDRESS'][idx]

new_table = table_1.merge(table_2, how='outer')
print(new_table)

关于正则表达式的一些有用的SO问题（例如，使用？：运算符仅替换匹配项的子集）：

How to replace only part of the match with python re.sub

python re.sub, only replace part of match

合并仅与Python部分匹配的列？

1 个答案: