"Table 1" - "Table 2" - "Intended Table"
"5 Smith Rd" - "5 Smith Rd" - "5 Smith Rd"
"7 John Rd" - "7 John Rd" - "7 John Rd"
"Ft 1, 7 James Rd" - "7 James Rd" - "Flat 1, 7 James Rd"
"Flat 1, 7 Smith street" - "FT1, 7 Smith Street" - "Flat 1, 7 Smith Street"
这只是一个示例表,实际表遵循类似的格式,但是有70k行。
我正在使用熊猫来合并数据集,我正在通过“地址”列进行合并。大多数地址的格式相同,因此匹配。但是,如示例所示,某些地址列由于格式不匹配。我想知道的是:如果列匹配度达到80%或类似,是否可以合并列?这让我很沮丧,因为我需要合并表,但是手动进行将需要数周的时间。
合并时,我得到了很多匹配项,但是缺少那些不遵循相同格式的匹配项,因此它破坏了我的最终数据集。
答案 0 :(得分:0)
据我所知,pandas.merge()没有提供任何匹配子集的选项。一种方法是先使用regexp并替换表,然后合并。但是,此方法的适用性取决于条目的不同格式的可预测性。
有关适用于示例数据集的某些代码,请参见下文,但70k数据集中可能存在边缘情况。这里假定任何带小数点的地址都将遵循模式“ f [some_charaters] [a digit] [逗号]”,并将其替换为“ flat [the digit] [逗号]”。
import re
import pdb
table_1 = pd.DataFrame({'ADDRESS': ["5 Smith Rd" , "7 John Rd", "Ft 1, 7 James Rd", "Flat 1, 7 Smith street"]})
table_2 = pd.DataFrame({ 'ADDRESS': ["5 Smith Rd", "7 John Rd", "7 James Rd", "FT1, 7 Smith Street"]})
# make all lowercase to avoid case formatting issue
table_1['ADDRESS'] = table_1['ADDRESS'].str.lower()
table_2['ADDRESS'] = table_2['ADDRESS'].str.lower()
# Find the pattern in the string that starts with f, has any number of characters, and then has the pattern d, where d
# is any digit. If this pattern exists, replace with flat.
table_1['ADDRESS'] = table_1['ADDRESS'].replace({'(?:^f.*)(\d,)': r'flat \1'}, regex=True)
table_2['ADDRESS'] = table_2['ADDRESS'].replace({'(?:^f.*)(\d,)': r'flat \1'}, regex=True)
# Now there is an issue that some addresses have the flat while others just have the roadname.
# Here iterate through both dataframes and if one entry has the flat while the other doesn't, replace
# the one that doens't with the one that does
regex = re.compile(r"flat \d,")
for idx, (t1_address, t2_address) in enumerate(zip(table_1['ADDRESS'],
table_2['ADDRESS'])):
if re.match(regex, t1_address) and \
not re.match(regex, t2_address):
table_2['ADDRESS'][idx] = table_1['ADDRESS'][idx]
elif re.match(regex, t2_address) and \
not re.match(regex, t1_address):
table_1['ADDRESS'][idx] = table_2['ADDRESS'][idx]
new_table = table_1.merge(table_2, how='outer')
print(new_table)
关于正则表达式的一些有用的SO问题(例如,使用?:运算符仅替换匹配项的子集):