如何使用extract_links()从“ gb2312”编码的网页获取网址

时间:2018-08-16 15:33:37

标签: python scrapy codec

环境:python 2.7 os:ubuntu

我想从网页中提取一些链接,并在刮y的外壳中对其进行测试 但我遇到UnicodeError:

我的代码:

le.extract_links(response.body.decode('gb2312')), 

错误:

  

UnicodeDecodeError:“ utf8”编解码器无法解码位置的字节0xcc   39:无效的继续字节

在此网页的源代码中,我发现它正在编码“ gb2312”,所以我尝试:

打印出response.body.decode('gb2312')它可以打印所有html

但何时:

Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractors/lxmlhtml.py", line 128, in extract_links
    links = self._extract_links(doc, response.url, response.encoding, base_url)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractors/__init__.py", line 109, in _extract_links
    return self.link_extractor._extract_links(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractors/lxmlhtml.py", line 76, in _extract_links
    return self._deduplicate_if_needed(links)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractors/lxmlhtml.py", line 91, in _deduplicate_if_needed
    return unique_list(links, key=self.link_key)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/utils/python.py", line 78, in unique
    seenkey = key(item)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractors/lxmlhtml.py", line 43, in <lambda>
    keep_fragments=True)
  File "/usr/local/lib/python2.7/dist-packages/w3lib/url.py", line 433, in canonicalize_url
    parse_url(url), encoding=encoding)
  File "/usr/local/lib/python2.7/dist-packages/w3lib/url.py", line 510, in parse_url
    return urlparse(to_unicode(url, encoding))
  File "/usr/local/lib/python2.7/dist-packages/w3lib/util.py", line 27, in to_unicode
    return text.decode(encoding, errors)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 39: invalid continuation byte

有错误:

  

AttributeError:“ unicode”对象没有属性“ text”

因为extract_links需要htmlresponse作为参数,但是response.body response.text返回“字节”和“ Unicode”对象;

类型(响应)

结果:类“ scrapy.http.response.html.HtmlResponse”

所以我不知道如何解决响应,并从中提取链接; 有什么方法可以指定返回的响应是“ utf-8”而不是“ gb2312”

RecyclerView

2 个答案:

答案 0 :(得分:1)

我认为您应该可以像这样手动指定编码: response.replace(encoding='gb2312'),然后尝试将其传递给链接提取器。

编辑:因此似乎很难在链接处理链中的某个地方指定url编码(执行重复数据删除时,我相信在w3lib.url.canonicalize_url处)。作为解决方法,您可以使用以下方法:

resp = response.replace(encoding='utf8', body=response.text.encode('utf8'))

答案 1 :(得分:0)

w3lib.url.canonicalize_url在此网页中无法正常工作,并且上述解决方法

resp = response.replace(encoding='utf8', body=response.text.encode('utf8'))

仅适用于易碎的外壳

所以我们可以在蜘蛛中指定canonicalize = True

像这样:

LinkExtractor(canonicalize=True)

但在通常情况下,在易碎文件中表示

  

您正在使用LinkExtractor跟踪链接,因此保持链接更健壮   默认canonicalize = False