Question

链接编码

当抓取网站时，scrapy会提取包含＆amp; amd和throws excption的链接：不要使用unicode URL实例化Link对象。假设utf-8编码（这可能是错误的）那么我该如何解决这个错误呢？

Answer 1

我在某些链接上插入此字符→时遇到了同样的问题。我在github上找到this related commit，而不是使用this advice来编写文件link_extractors.py：

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.response import get_base_url


class CustomLinkExtractor(SgmlLinkExtractor):
"""Need this to fix the encoding error."""

    def extract_links(self, response):
        base_url = None
        if self.restrict_xpaths:
            hxs = HtmlXPathSelector(response)
            base_url = get_base_url(response)
            body = u''.join(f for x in self.restrict_xpaths
                           for f in hxs.select(x).extract())
            try:
                body = body.encode(response.encoding)
            except UnicodeEncodeError:
                body = body.encode('utf-8')
        else:
            body = response.body

        links = self._extract_links(body, response.url, response.encoding, base_url)
        links = self._process_links(links)
        return links

之后我在我的spiders.py中使用了它：

rules = (
    Rule(CustomLinkExtractor(allow=('/gp/offer-listing*', ),
                           restrict_xpaths=("//li[contains(@class,'a-last')]/a", )),
         callback='parse_start_url', follow=True,

         ),
)

Scrapy python：unicode链接错误

1 个答案: