Question

我使用 Scrapy 几周，最近，我发现 HtmlXPathSelector 无法正确解析某些html文件。

在网页http://detail.zol.com.cn/series/268/10227_1.html中，只有一个名为

的标签

`div id='param-more' class='mod_param  '`.

当我使用xpath “// div [@ id ='param-more']”来选择标记时，它返回 [] 。

我尝试过 scrapy shell 并得到了相同的结果。

使用 wget 检索网页时，我还可以在html中找到标签“div id ='param-more'class ='mod_param'”源文件，我认为这不是由于触发操作显示标记的原因造成的。

请给我一些如何解决这个问题的提示。

以下是关于该问题的代码。处理上述网址时， len（nodes_product）始终为 0

def parse_series(self, response):
    hxs = HtmlXPathSelector(response)

    xpath_product = "//div[@id='param-normal']/table//td[@class='name']/a | "\
                    "//div[@id='param-more']/table//td[@class='name']/a"
    nodes_product = hxs.select(xpath_product)
    if len(nodes_product) == 0:
        # there's only the title, no other products in the series
        .......
    else:
        .......

Answer 1

这似乎是XPathSelectors的一个错误。我创建了一个快速测试蜘蛛并遇到了同样的问题。我认为它与页面上的非标准字符有关。

我不相信问题是'param-more'div与任何javascript事件或CSS隐藏相关联。我禁用了javascript并且还更改了我的用户代理（和位置）以查看是否会影响页面上的数据。它没有。

然而，我能够使用beautifulsoup解析'param-more'div：

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup

class TestSpider(BaseSpider):
    name = "Test"

    start_urls = [
        "http://detail.zol.com.cn/series/268/10227_1.html"
                 ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)

        #data = hxs.select("//div[@id='param-more']").extract()

        data = response.body
        soup = BeautifulSoup(data)
        print soup.find(id='param-more')

其他人可能对XPathSelect问题有更多了解，但目前，您可以将beautifulsoup找到的HTML保存到项目中并将其传递到管道中。

以下是最新的beautifulsoup版本的链接：http://www.crummy.com/software/BeautifulSoup/#Download

<强>更新

我相信我找到了具体问题。正在讨论的网页在元标记中指定它使用GB 2312 charset。从GB 2312到unicode的转换是有问题的，因为有些字符没有unicode equivalent。除了UnicodeDammit（beautifulsoup的编码检测模块）确实将编码确定为ISO 8859-2之外，这不会成为问题。问题是lxml通过查看charset specified in the meta tag of the header来确定文档的编码。因此，lxml和scrapy感知的编码类型不匹配。

以下代码演示了上述问题，并提供了必须依赖BS4库的替代方法：

from scrapy.selector import HtmlXPathSelector
from scrapy.spider import BaseSpider
from bs4 import BeautifulSoup
import chardet

class TestSpider(BaseSpider):
    name = "Test"

    start_urls = [
        "http://detail.zol.com.cn/series/268/10227_1.html"
                 ]

    def parse(self, response):

        encoding = chardet.detect(response.body)['encoding']
        if encoding != 'utf-8':
            response.body = response.body.decode(encoding, 'replace').encode('utf-8')

        hxs = HtmlXPathSelector(response)
        data = hxs.select("//div[@id='param-more']").extract()
        #print encoding
        print data

在这里，您看到通过强制lxml使用utf-8编码，它不会尝试映射它所感知的GB 2312-＆gt; utf-8。

在scrapy中，HTMLXPathSelectors编码在scrapy / select / lxmlsel.py模块中设置。此模块使用response.encoding属性将响应主体传递给lxml解析器，该属性最终在scrapy / http / response / test.py模块中设置。

处理设置response.encoding属性的代码如下：

@property
def encoding(self):
    return self._get_encoding(infer=True)

def _get_encoding(self, infer=False):
    enc = self._declared_encoding()
    if enc and not encoding_exists(enc):
        enc = None
    if not enc and infer:
        enc = self._body_inferred_encoding()
    if not enc:
        enc = self._DEFAULT_ENCODING
    return resolve_encoding(enc)

def _declared_encoding(self):
    return self._encoding or self._headers_encoding() \
        or self._body_declared_encoding()

这里需要注意的重要一点是，_headers_encoding和_encoding都将最终反映标头中元标记中声明的编码，而不是实际使用UnicodeDammit或chardet来确定文档编码。因此，如果文档包含其指定的编码的无效字符，则会出现这种情况，我相信Scrapy会忽略这一点，最终导致我们今天看到的问题。

Answer 2

'mod_param ' != 'mod_param'

该课程不是相等“mod_param”，但包含“mod_param”，请注意末尾有空格：

stav@maia:~$ scrapy shell http://detail.zol.com.cn/series/268/10227_1.html
2012-08-23 09:17:28-0500 [scrapy] INFO: Scrapy 0.15.1 started (bot: scrapybot)
Python 2.7.3 (default, Aug  1 2012, 05:14:39)
IPython 0.12.1 -- An enhanced Interactive Python.

In [1]: hxs.select("//div[@class='mod_param']")
Out[1]: []

In [2]: hxs.select("//div[contains(@class,'mod_param')]")
Out[2]: [<HtmlXPathSelector xpath="//div[contains(@class,'mod_param')]" data=u'<div id="param-more" class="mod_param  "'>]

In [3]: len(hxs.select("//div[contains(@class,'mod_param')]").extract())
Out[3]: 1

In [4]: len(hxs.select("//div[contains(@class,'mod_param')]").extract()[0])
Out[4]: 5372

Scrapy无法正确解析某些html文件

2 个答案: