合并仅与Python部分匹配的列?

时间:2020-09-22 09:48:34

标签: python pandas join merge

"Table 1"          -         "Table 2"          -      "Intended Table"             
"5 Smith Rd"        -        "5 Smith Rd"        -     "5 Smith Rd"             
"7 John Rd"            -      "7 John Rd"        -     "7 John Rd"              
"Ft 1, 7 James Rd"      -    "7 James Rd"         -   "Flat 1, 7 James Rd"              
"Flat 1, 7 Smith street" -  "FT1, 7 Smith Street" - "Flat 1, 7 Smith Street"

这只是一个示例表,实际表遵循类似的格式,但是有70k行。

我正在使用熊猫来合并数据集,我正在通过“地址”列进行合并。大多数地址的格式相同,因此匹配。但是,如示例所示,某些地址列由于格式不匹配。我想知道的是:如果列匹配度达到80%或类似,是否可以合并列?这让我很沮丧,因为我需要合并表,但是手动进行将需要数周的时间。

合并时,我得到了很多匹配项,但是缺少那些不遵循相同格式的匹配项,因此它破坏了我的最终数据集。

1 个答案:

答案 0 :(得分:0)

据我所知,pandas.merge()没有提供任何匹配子集的选项。一种方法是先使用regexp并替换表,然后合并。但是,此方法的适用性取决于条目的不同格式的可预测性。

有关适用于示例数据集的某些代码,请参见下文,但70k数据集中可能存在边缘情况。这里假定任何带小数点的地址都将遵循模式“ f [some_charaters] [a digit] [逗号]”,并将其替换为“ flat [the digit] [逗号]”。

import re
import pdb

table_1 = pd.DataFrame({'ADDRESS':  ["5 Smith Rd" , "7 John Rd", "Ft 1, 7 James Rd", "Flat 1, 7 Smith street"]})
table_2 = pd.DataFrame({ 'ADDRESS': ["5 Smith Rd", "7 John Rd", "7 James Rd", "FT1, 7 Smith Street"]})

# make all lowercase to avoid case formatting issue
table_1['ADDRESS'] = table_1['ADDRESS'].str.lower()
table_2['ADDRESS'] = table_2['ADDRESS'].str.lower()

# Find the pattern in the string that starts with f, has any number of characters, and then has the pattern d, where d
# is any digit. If this pattern exists, replace with flat.
table_1['ADDRESS'] = table_1['ADDRESS'].replace({'(?:^f.*)(\d,)': r'flat \1'}, regex=True)
table_2['ADDRESS'] = table_2['ADDRESS'].replace({'(?:^f.*)(\d,)': r'flat \1'}, regex=True)

# Now there is an issue that some addresses have the flat while others just have the roadname.
# Here iterate through both dataframes and if one entry has the flat while the other doesn't, replace
# the one that doens't with the one that does
regex = re.compile(r"flat \d,")
for idx, (t1_address, t2_address) in enumerate(zip(table_1['ADDRESS'],
                                                   table_2['ADDRESS'])):
    if re.match(regex, t1_address) and \
            not re.match(regex, t2_address):
        table_2['ADDRESS'][idx] =  table_1['ADDRESS'][idx]
        
    elif re.match(regex, t2_address) and \
            not re.match(regex, t1_address):
        table_1['ADDRESS'][idx] = table_2['ADDRESS'][idx]

new_table = table_1.merge(table_2, how='outer')
print(new_table)

关于正则表达式的一些有用的SO问题(例如,使用?:运算符仅替换匹配项的子集):

How to replace only part of the match with python re.sub

python re.sub, only replace part of match

相关问题