Question

我正在尝试从此html片段中获取文字“＆lt; 1小时”。

<div class="details_wrapper">
<div class="detail">
    <b>Recommended length of visit:</b>
    <1 hour
</div>
<div class="detail">
    <b>Fee:</b>
    No
</div>
</div>

这是我正在使用的xpath表达式：

visit_length = response.xpath(
    "//div[@class='details_wrapper']/"
    "div[@class='detail']/b[contains(text(), "
    "'Recommended length of visit:')]/parent::div/text()"
).extract()

但它无法获得文字。我认为这是由于“＆lt;”在我需要的文本中，它被视为一个html标记。我如何刮取文字“＆lt; 1小时”？

Answer 1

考虑到Scrapy使用lxml，可能值得检查lxml如何处理这种HTML，其中包含一个文本节点中的XML特殊字符<： / p>

>>> from lxml import html
>>> raw = '''<div class="details_wrapper">
... <div class="detail">
...     <b>Recommended length of visit:</b>
...     <1 hour
... </div>
... <div class="detail">
...     <b>Fee:</b>
...     No
... </div>
... </div>'''
... 
>>> root = html.fromstring(raw)
>>> print html.tostring(root)
<div class="details_wrapper">
<div class="detail">
    <b>Recommended length of visit:</b>

<div class="detail">
    <b>Fee:</b>
    No
</div>
</div></div>

请注意，在上面的演示中，您怀疑文本节点'<1 hour'已完全从root元素源中消失。要解决此问题，请考虑使用BeautifulSoup，因为在处理此HTML案例时更合理（您可以通过response.body_as_unicode()从Scrapy响应中创建soup）：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(raw, "html.parser")
>>> print soup.prettify()
<div class="details_wrapper">
 <div class="detail">
  <b>
   Recommended length of visit:
  </b>
  &lt;1 hour
 </div>
 <div class="detail">
  <b>
   Fee:
  </b>
  No
 </div>
</div>

使用BS查找目标文本节点可以按照以下步骤完成：

>>> soup.find('b', text='Recommended length of visit:').next_sibling
u'\n    <1 hour\n'

Answer 2

这是一个lxml问题，已在scrapy解析器Parsel上报告，请点击此处the issue。

正如其中所说，解决方案是将type='xml'参数传递给选择器，你的蜘蛛应该是这样的：

from scrapy import Selector
...
...
    def your_parse_method(self, response):
        sel = Selector(text=response.body_as_unicode(), type='xml')
        # now use "sel" instead of response for getting xpath info
        ...
        visit_length = sel.xpath("//div[@class='details_wrapper']/"
            "div[@class='detail']/b[contains(text(), "
            "'Recommended length of visit:')]/parent::div/text()").extract()

Scrapy xpath获取以＆lt;开头的元素的文本。

2 个答案: