Question

我抓取给定关键字的各种工作页面，并在匹配时提取标题和链接。

XPATH_MAPPING_SINGLE_PAGE = {'heading' : "//*[self::h2 or self::h3 or self::h4 or self::dt][contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '%s')]"}
XPATH_MAPPING_HYPERLINKS = {'href': "//a[contains(translate(normalize-space(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '%s')]/@href",
                        'text': "//a[contains(translate(normalize-space(), 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '%s')]"}

import urllib2
import urlparse

import lxml.html as lh

response = urllib2_urlopen(url)
content = response.read()
root = lh.fromstring(content)
titles_and_links = get_individual_job_titles_and_hyperlinks(root, keyword)

def get_individual_job_titles_and_hyperlinks(root, keyword):
    texts = [element.text_content().strip() for element in root.xpath(XPATH_MAPPING_HYPERLINKS['text'] % keyword)]
    hrefs = root.xpath(XPATH_MAPPING_HYPERLINKS['href'] % keyword)
    return zip(texts, hrefs)

这非常可靠。然而，对于像https://www.gosquared.com/careers/和关键字＆＃39; Engineer＆＃39;这样的页面，它会在其上提取单个工程作业，但也会提取指向公司工程博客页面的链接：

>>print title_and_links
[('Engineering Blog', '//engineering.gosquared.com/'), ('Software Engineer', '/careers/software-engineer/'), ('Engineering Blog', '//engineering.gosquared.com/')]

这显然正在发生，因为我的XPath基于contains()。一旦找到文字＆＃39;工程师＆＃39;它会认为它是匹配，因此解释了为什么工程师 ing ＆＃39;链接也被拿起。

如何修改XPath以使其不会产生这些误报？更新的XPath需要知道在关键字结束后立即停止并且可能期望一些标点符号（空格，连字符，句点，逗号等）而不是字母，从而仍然正确地拾取链接文本，如：

Engineer-Mechanical
化学家 - 制药
医疗保健部顾问
等

这可以完全使用XPath完成，而无需添加正则表达式来预期标点符号或空格吗？

Answer 1

我假设我们不能依赖页面上可能出现职位的任何特定部分。

但是，我很确定，您可以避免查看header和footer元素。检查父母：

//*
  [self::h2 or self::h3 or self::h4 or self::dt]
  [contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), '%s')]
  [not(ancestor::footer) and not(ancestor::header)]

这有助于在这个特定情况下不匹配Engineering Blog，因为它位于页脚中。

告诉基于contains（）的XPath查询一旦到达字母就停止？

1 个答案: