Python - 从域和页面标题解析公司名称

时间:2017-01-30 10:08:59

标签: python parsing

我一直在努力解析HTML中的域名和页面标题中的公司名称。假设我的域名是:

http://thisismycompany.com

,页面标题为:

This is an example page title | My Company

我的假设是,当我匹配这些中最长的公共子串时,在小写和删除除字母数字之外的所有子串之后,这很可能是公司名称。

因此,最长的公共子字符串(Link to python 3 code)将返回mycompany。我如何将这个子字符串匹配回原始页面标题,以便我可以检索空格和超级字符串的正确位置。

2 个答案:

答案 0 :(得分:1)

我考虑过使用正则表达式是否可行,但我认为使用正常的字符串操作/比较会更容易,特别是因为这看起来不像时间敏感的任务。

def find_name(normalized_name, full_name_container):
  n = 0
  full_name = ''
  for i in range(0, len(full_name_container)):
    if n == len(normalized_name):
      return full_name

    # If the characters at the current position in both
    # strings match, add the proper case to the final string
    # and move onto the next character
    if (normalized_name[n]).upper() == (full_name_container[i]).upper():
      full_name += full_name_container[i]
      n += 1

    # If the name is interrupted by a separator, add that to the result  
    elif full_name_container[i] in ['-', '_', '.', ' ']:
      full_name += full_name_container[i]

    # If a character is encountered that is definitely not part of the name
    # Re-start the search
    else:
      n = 0
      full_name = ''

  return full_name

print(find_name('mycompany', 'Some stuff My Company Some Stuff'))

这应打印出"我的公司"。对可能会中断标准化名称的可能项目(如空格和逗号)进行硬编码可能是您必须要改进的。

答案 1 :(得分:1)

我通过生成标题的所有可能子串的列表来解决它。然后将其与我从最长公共子字符串函数中获得的匹配进行匹配。

def get_all_substrings(input_string):
    length = len(input_string)
    return set([input_string[i:j+1] for i in range(length) for j in range(i,length)])

longest_substring_match = 'mycompany'
page_title = 'This is an example page title | My Company'

for substring in get_all_substrings(page_title):
    if re.sub('[^0-9a-zA-Z]+', '', substring).lower() == longest_substring_match.lower():
        match = substring
        break

print(match)

修改:source used