Question

所以我希望能够做的是将字符串与许多其他字符串进行比较，以查看哪些字符串具有更好的匹配

目前我正在使用re.search获取匹配的字符串，然后我用它来分割字符串并占用我想要的一半

company = re.search("Supplier Address:?|Supplier Identification:?|Supplier 
Name:?|Supplier:?|Company Information:?|Company's Name:?|Manufacturer's 
Name|Manufacturer:?|MANUFACTURER:?|Manufacturer Name:?", arg)

但这并不是真的很好，特别是因为我有几个像这样的字符串

"SECTION 1 - MANUFACTURER'S INFORMATION Manufacturer Name HAYWARD 
 LABORATORIES Emergency"

我想要

HAYWARD LABORATORIES

在这个字符串中，我们现在就这样做，它与当前获得的MANUFACTURER相匹配：

'S INFORMATION Manufacturer Name HAYWARD LABORATORIES

我该如何解决这个问题？还有更好的方法吗？感谢

修改我正在处理的更多字符串：

"Identification of the company Lutex Company Limited 20/F., "

Lutex Company Limited

"Product and Company Information Product Name: Lip Balm Base Product Code: A462-BALM Client Code: 900 Company: Ni Hau Industrial Co., Ltd. Company Address:"

Ni Hau Industrial Co.，Ltd。

Answer 1

如果您的所有部分在模式Name FACTORY NAME方面都相同，那么您可以试试这个：

import re
s = "SECTION 1 - MANUFACTURER'S INFORMATION Manufacturer Name HAYWARD LABORATORIES Emergency"
final_data = re.findall("(?<=Name\s)[A-Z]+\s[A-Z]+", s)

输出：

['HAYWARD LABORATORIES']

Answer 2

你可以使用fuzzywuzzy模块来实现某种模糊匹配，基本上你会计算两个字符串之间的距离，距离越近的字符串就越小。

例如，让我们假设您有一个字符串列表，您正在搜索最接近的匹配项，如下所示：

from fuzzywuzzy import fuzz

string_to_be_matched = 'string_sth'
list_of_strings = ['string_1', 'string_2',.., 'string_n']

# we will store the index , plus the distance for each string in list_of_strings
result = [ (i, fuzz.ratio(string_to_be_matched, x)) for x, i in enumerate(list_of_strings) ]

有关fuzzywuzzy模块的更多信息，请参阅link

比较多个字符串以找到最佳匹配

2 个答案: