我一直在努力解析HTML中的域名和页面标题中的公司名称。假设我的域名是:
http://thisismycompany.com
,页面标题为:
This is an example page title | My Company
我的假设是,当我匹配这些中最长的公共子串时,在小写和删除除字母数字之外的所有子串之后,这很可能是公司名称。
因此,最长的公共子字符串(Link to python 3 code)将返回mycompany
。我如何将这个子字符串匹配回原始页面标题,以便我可以检索空格和超级字符串的正确位置。
答案 0 :(得分:1)
我考虑过使用正则表达式是否可行,但我认为使用正常的字符串操作/比较会更容易,特别是因为这看起来不像时间敏感的任务。
def find_name(normalized_name, full_name_container):
n = 0
full_name = ''
for i in range(0, len(full_name_container)):
if n == len(normalized_name):
return full_name
# If the characters at the current position in both
# strings match, add the proper case to the final string
# and move onto the next character
if (normalized_name[n]).upper() == (full_name_container[i]).upper():
full_name += full_name_container[i]
n += 1
# If the name is interrupted by a separator, add that to the result
elif full_name_container[i] in ['-', '_', '.', ' ']:
full_name += full_name_container[i]
# If a character is encountered that is definitely not part of the name
# Re-start the search
else:
n = 0
full_name = ''
return full_name
print(find_name('mycompany', 'Some stuff My Company Some Stuff'))
这应打印出"我的公司"。对可能会中断标准化名称的可能项目(如空格和逗号)进行硬编码可能是您必须要改进的。
答案 1 :(得分:1)
我通过生成标题的所有可能子串的列表来解决它。然后将其与我从最长公共子字符串函数中获得的匹配进行匹配。
def get_all_substrings(input_string):
length = len(input_string)
return set([input_string[i:j+1] for i in range(length) for j in range(i,length)])
longest_substring_match = 'mycompany'
page_title = 'This is an example page title | My Company'
for substring in get_all_substrings(page_title):
if re.sub('[^0-9a-zA-Z]+', '', substring).lower() == longest_substring_match.lower():
match = substring
break
print(match)
修改:source used