Question

我有公司名称列表，我有一个网址列表，提到公司名称。

最终目标是查看网址，并找出网址中有多少公司在我的列表中。

示例网址：http://www.dmx.com/about/our-clients

每个网址的结构都不同，因此我没有办法进行正则表达式搜索并为每个公司名称创建单独的字符串。

我想构建一个for循环，从列表中搜索URL的整个内容中的每个公司。但看起来Levenshtein对于两个较小的字符串更好，而不是短字符串和大量文本。

这个初学者应该在哪里看？

Answer 1

对我而言，你不需要任何“模糊”匹配。而且我假设当你说“url”时你的意思是“网址指向网址的网页”。只需使用Python的内置子字符串搜索功能：

>>> import urllib2
>>> webpage = urllib2.urlopen('http://www.dmx.com/about/our-clients')
>>> webpage_text = webpage.read()
>>> webpage.close()
>>> for name in ['Caribou Coffee', 'Express', 'Sears']:
...     if name in webpage_text:
...         print name, "found!"
... 
Caribou Coffee found!
Express found!
>>>

如果您担心字符串大写不匹配，只需将其全部转换为大写。

>>> webpage_text = webpage_text.upper()
>>> for name in ['CARIBOU COFFEE', 'EXPRESS', 'SEARS']:
...     if name in webpage_text:
...         print name, 'found!'
... 
CARIBOU COFFEE found!
EXPRESS found!

Answer 2

我会在发送者的回答中添加一个问题，即以某种方式规范化你的名字是有意义的（例如，删除所有特殊字符，然后将其应用于webpage_text和你的字符串列表。

def normalize_str(some_str):
    some_str = some_str.lower()
    for c in """-?'"/{}[]()&!,.`""":
        some_str = some_str.replace(c,"")
    return some_str

如果这还不够好，你可以去difflib做类似的事情：

for client in normalized_client_names:
    closest_client = difflib.get_closest_match(client_name, webpage_text,1,0.8)
    if len(closest_client) > 0:
         print client_name, "found as", closest_client[0]

我选择的任意截止（Ratcliff / Obershelp）比率为0.8可能过于宽松或过分;玩了一下。

模拟匹配Python（url）中大量文本中的字符串

2 个答案: