def parse_linkpage(self, response):
hxs = HtmlXPathSelector(response)
item = QualificationItem()
xpath = """
//h2[normalize-space(.)="Entry requirements for undergraduate courses"]
/following-sibling::p
"""
item['Qualification'] = hxs.select(xpath).extract()[1:]
item['Country'] = response.meta['a_of_the_link']
return item
所以我想知道是否可以让我的代码在到达<h2>
结束后停止抓取。
这是网页:
<h2>Entry requirements for undergraduate courses</h2>
<p>Example1</p>
<p>Example2</p>
<h2>Postgraduate Courses</h2>
<p>Example3</p>
<p>Example4</p>
我想要这些结果:
Example1
Example2
但我明白了:
Example1
Example2
Example3
Example4
我知道我可以改变这一行,
item['Qualification'] = hxs.select(xpath).extract()
要,
item['Qualification'] = hxs.select(xpath).extract()[0:2]
但是这个刮刀查看了许多不同的页面,这些页面在第一个标题中可能有超过2个段落,这意味着它会留下这些信息。
我想知道是否有办法告诉它提取我想要的标题之后的确切数据而不是所有内容?
答案 0 :(得分:2)
它不是很漂亮或易于阅读,但您可以对XPath使用EXSLT扩展并使用set:difference()
操作:
>>> selector.xpath("""
set:difference(//h2[normalize-space(.)="Entry requirements for undergraduate courses"]
/following-sibling::p,
//h2[normalize-space(.)="Entry requirements for undergraduate courses"]
/following-sibling::h2[1]
/following-sibling::p)""").extract()
[u'<p>Example1</p>', u'<p>Example2</p>']
我们的想法是选择跟踪目标p
之后的所有h2
,并排除下一个p
之后的h2
更容易阅读的版本:
>>> for h2 in selector.xpath('//h2[normalize-space(.)="Entry requirements for undergraduate courses"]'):
... paragraphs = h2.xpath("""set:difference(./following-sibling::p,
... ./following-sibling::h2[1]/following-sibling::p)""").extract()
... print paragraphs
...
[u'<p>Example1</p>', u'<p>Example2</p>']
>>>
答案 1 :(得分:0)
也许你可以使用这个xpath
//h2[normalize-space(.)="Entry requirements for undergraduate courses"]
/following-sibling::p[not(preceding-sibling::h2[normalize-space(.)!="Entry requirements for undergraduate courses"])]
你可以添加following-sibling::p
的另一个谓词,不包括那些前兄弟不等于“本科课程入学要求”的p