Scrapy python:unicode链接错误

时间:2013-07-25 15:15:27

标签: python scrapy

链接编码

当抓取网站时,scrapy会提取包含& amd和throws excption的链接: 不要使用unicode URL实例化Link对象。假设utf-8编码(这可能是错误的)那么我该如何解决这个错误呢?

1 个答案:

答案 0 :(得分:0)

我在某些链接上插入此字符时遇到了同样的问题。我在github上找到this related commit,而不是使用this advice来编写文件link_extractors.py

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.response import get_base_url


class CustomLinkExtractor(SgmlLinkExtractor):
"""Need this to fix the encoding error."""

    def extract_links(self, response):
        base_url = None
        if self.restrict_xpaths:
            hxs = HtmlXPathSelector(response)
            base_url = get_base_url(response)
            body = u''.join(f for x in self.restrict_xpaths
                           for f in hxs.select(x).extract())
            try:
                body = body.encode(response.encoding)
            except UnicodeEncodeError:
                body = body.encode('utf-8')
        else:
            body = response.body

        links = self._extract_links(body, response.url, response.encoding, base_url)
        links = self._process_links(links)
        return links

之后我在我的spiders.py中使用了它:

rules = (
    Rule(CustomLinkExtractor(allow=('/gp/offer-listing*', ),
                           restrict_xpaths=("//li[contains(@class,'a-last')]/a", )),
         callback='parse_start_url', follow=True,

         ),
)