scrapy:处理网址

时间:2017-10-20 00:23:26

标签: python-3.x web-scraping scrapy

我正在抓取一个包含é等特殊字符的XML站点地图,其中包含

ERROR: Spider error processing <GET [URL with '%C3%A9' instead of 'é']>

如何让Scrapy按原样保留原始网址,即包含特殊字符?

Scrapy == 1.3.3

的Python == 3.5.2 (我需要坚持这些版本)

更新: 根据{{​​3}},我可以使用unquote获取具有正确字符的网址:

使用示例:

>>> from urllib.parse import unquote
>>> unquote('ros%C3%A9')
'rosé'

我还尝试了自己的Request子类而没有safe_url_string,但我最终得到了:

UnicodeEncodeError: 'ascii' codec can't encode character '\xf9' in position 25: ordinal not in range(128)

完整追溯:

[scrapy.core.scraper] ERROR: Error downloading <GET [URL with characters like ù]>
Traceback (most recent call last):
  File "/usr/share/anaconda3/lib/python3.5/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/twisted/python/failure.py", line 393, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/defer.py", line 45, in mustbe_deferred
result = f(*args, **kw)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/__init__.py", line 65, in download_request
return handler.download_request(request, spider)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 61, in download_request
return agent.download_request(request)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 260, in download_request
agent = self._get_agent(request, timeout)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/handlers/http11.py", line 241, in _get_agent
scheme = _parse(request.url)[0]
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py", line 37, in _parse
return _parsed_url_args(parsed)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py", line 19, in _parsed_url_args
path = b(path)
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/core/downloader/webclient.py", line 17, in <lambda>
b = lambda s: to_bytes(s, encoding='ascii')
  File "/usr/share/anaconda3/lib/python3.5/site-packages/scrapy/utils/python.py", line 120, in to_bytes
return text.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode character '\xf9' in position 25: ordinal not in range(128)

任何提示?

2 个答案:

答案 0 :(得分:1)

在存储safe_url_string网址之前,我不认为您可以w3lib Request从{{1}}库中执行此操作。你会以某种方式扭转它。

答案 1 :(得分:0)

您可以在URL之前使用'r'字母: url = r'name of that url'