Scrapy xpath获取以<开头的元素的文本。

时间:2016-05-02 02:13:55

标签: python xpath scrapy

我正在尝试从此html片段中获取文字“< 1小时”。

<div class="details_wrapper">
<div class="detail">
    <b>Recommended length of visit:</b>
    <1 hour
</div>
<div class="detail">
    <b>Fee:</b>
    No
</div>
</div>

这是我正在使用的xpath表达式:

visit_length = response.xpath(
    "//div[@class='details_wrapper']/"
    "div[@class='detail']/b[contains(text(), "
    "'Recommended length of visit:')]/parent::div/text()"
).extract()

但它无法获得文字。我认为这是由于“&lt;”在我需要的文本中,它被视为一个html标记。我如何刮取文字“&lt; 1小时”?

2 个答案:

答案 0 :(得分:2)

考虑到Scrapy使用lxml,可能值得检查lxml如何处理这种HTML,其中包含一个文本节点中的XML特殊字符<: / p>

>>> from lxml import html
>>> raw = '''<div class="details_wrapper">
... <div class="detail">
...     <b>Recommended length of visit:</b>
...     <1 hour
... </div>
... <div class="detail">
...     <b>Fee:</b>
...     No
... </div>
... </div>'''
... 
>>> root = html.fromstring(raw)
>>> print html.tostring(root)
<div class="details_wrapper">
<div class="detail">
    <b>Recommended length of visit:</b>

<div class="detail">
    <b>Fee:</b>
    No
</div>
</div></div>

请注意,在上面的演示中,您怀疑文本节点'<1 hour'已完全从root元素源中消失。要解决此问题,请考虑使用BeautifulSoup,因为在处理此HTML案例时更合理(您可以通过response.body_as_unicode()从Scrapy响应中创建soup):

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(raw, "html.parser")
>>> print soup.prettify()
<div class="details_wrapper">
 <div class="detail">
  <b>
   Recommended length of visit:
  </b>
  &lt;1 hour
 </div>
 <div class="detail">
  <b>
   Fee:
  </b>
  No
 </div>
</div>

使用BS查找目标文本节点可以按照以下步骤完成:

>>> soup.find('b', text='Recommended length of visit:').next_sibling
u'\n    <1 hour\n'

答案 1 :(得分:1)

这是一个lxml问题,已在scrapy解析器Parsel上报告,请点击此处the issue

正如其中所说,解决方案是将type='xml'参数传递给选择器,你的蜘蛛应该是这样的:

from scrapy import Selector
...
...
    def your_parse_method(self, response):
        sel = Selector(text=response.body_as_unicode(), type='xml')
        # now use "sel" instead of response for getting xpath info
        ...
        visit_length = sel.xpath("//div[@class='details_wrapper']/"
            "div[@class='detail']/b[contains(text(), "
            "'Recommended length of visit:')]/parent::div/text()").extract()