链接编码
当抓取网站时,scrapy会提取包含& amd和throws excption的链接: 不要使用unicode URL实例化Link对象。假设utf-8编码(这可能是错误的)那么我该如何解决这个错误呢?
答案 0 :(得分:0)
我在某些链接上插入此字符→
时遇到了同样的问题。我在github上找到this related commit,而不是使用this advice来编写文件link_extractors.py
:
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.utils.response import get_base_url
class CustomLinkExtractor(SgmlLinkExtractor):
"""Need this to fix the encoding error."""
def extract_links(self, response):
base_url = None
if self.restrict_xpaths:
hxs = HtmlXPathSelector(response)
base_url = get_base_url(response)
body = u''.join(f for x in self.restrict_xpaths
for f in hxs.select(x).extract())
try:
body = body.encode(response.encoding)
except UnicodeEncodeError:
body = body.encode('utf-8')
else:
body = response.body
links = self._extract_links(body, response.url, response.encoding, base_url)
links = self._process_links(links)
return links
之后我在我的spiders.py
中使用了它:
rules = (
Rule(CustomLinkExtractor(allow=('/gp/offer-listing*', ),
restrict_xpaths=("//li[contains(@class,'a-last')]/a", )),
callback='parse_start_url', follow=True,
),
)