
时间:2020-09-22 09:48:34

标签: python pandas join merge

"Table 1"          -         "Table 2"          -      "Intended Table"             
"5 Smith Rd"        -        "5 Smith Rd"        -     "5 Smith Rd"             
"7 John Rd"            -      "7 John Rd"        -     "7 John Rd"              
"Ft 1, 7 James Rd"      -    "7 James Rd"         -   "Flat 1, 7 James Rd"              
"Flat 1, 7 Smith street" -  "FT1, 7 Smith Street" - "Flat 1, 7 Smith Street"




1 个答案:

答案 0 :(得分:0)


有关适用于示例数据集的某些代码,请参见下文,但70k数据集中可能存在边缘情况。这里假定任何带小数点的地址都将遵循模式“ f [some_charaters] [a digit] [逗号]”,并将其替换为“ flat [the digit] [逗号]”。

import re
import pdb

table_1 = pd.DataFrame({'ADDRESS':  ["5 Smith Rd" , "7 John Rd", "Ft 1, 7 James Rd", "Flat 1, 7 Smith street"]})
table_2 = pd.DataFrame({ 'ADDRESS': ["5 Smith Rd", "7 John Rd", "7 James Rd", "FT1, 7 Smith Street"]})

# make all lowercase to avoid case formatting issue
table_1['ADDRESS'] = table_1['ADDRESS'].str.lower()
table_2['ADDRESS'] = table_2['ADDRESS'].str.lower()

# Find the pattern in the string that starts with f, has any number of characters, and then has the pattern d, where d
# is any digit. If this pattern exists, replace with flat.
table_1['ADDRESS'] = table_1['ADDRESS'].replace({'(?:^f.*)(\d,)': r'flat \1'}, regex=True)
table_2['ADDRESS'] = table_2['ADDRESS'].replace({'(?:^f.*)(\d,)': r'flat \1'}, regex=True)

# Now there is an issue that some addresses have the flat while others just have the roadname.
# Here iterate through both dataframes and if one entry has the flat while the other doesn't, replace
# the one that doens't with the one that does
regex = re.compile(r"flat \d,")
for idx, (t1_address, t2_address) in enumerate(zip(table_1['ADDRESS'],
    if re.match(regex, t1_address) and \
            not re.match(regex, t2_address):
        table_2['ADDRESS'][idx] =  table_1['ADDRESS'][idx]
    elif re.match(regex, t2_address) and \
            not re.match(regex, t1_address):
        table_1['ADDRESS'][idx] = table_2['ADDRESS'][idx]

new_table = table_1.merge(table_2, how='outer')


How to replace only part of the match with python re.sub

python re.sub, only replace part of match
