Python函数查找不同格式的字符串之间的相似性

时间:2018-05-19 11:49:33

标签: python string python-3.x formatting substring

我有2个带有项目名称的excel文件。我想比较项目,但唯一远程相似的列是名称列,它也有不同的名称格式,如

KIDS-Piano kids piano

黄油凝胶100mg 作为 Butter-Gel-100MG

我知道它不能100%准确,所以我会要求操作代码的人进行最终验证,但如何显示最接近的匹配名称?

1 个答案:

答案 0 :(得分:1)

这样做的正确方法是编写正则表达式。

但是下面的vanilla代码也可以解决这个问题:

column_a = ["KIDS-Piano", "Butter Gel 100mg"]
column_b = ["kids piano", "Butter-Gel-100MG"]

new_column_a = []
for i in column_a:
    # convert strings into lowercase
    a = i.lower()
    # replace dashes with spaces
    a = a.replace('-', ' ')
    new_column_a.append(a)

# do the same for column b
new_column_b = []
for i in column_b:
    # convert strings into lowercase
    a = i.lower()
    # replace dashes with spaces
    a = a.replace('-', ' ')
    new_column_b.append(a)

as_not_found_in_b = []
for i in new_column_a:
    if i not in new_column_b:
        as_not_found_in_b.append(i)

bs_not_found_in_a = []
for i in new_column_b:
    if i not in new_column_a:
        bs_not_found_in_a.append(i)

# find the problematic ones and manually fix them
print(as_not_found_in_b)
print(bs_not_found_in_a)