使用file_urls和ITEM_PIPELINES保存已爬网页时出错:请求网址中缺少方案:h

时间:2016-06-20 14:49:13

标签: scrapy

我正在尝试让scrapy下载它抓取的每个页面的副本,但是当我运行我的蜘蛛时,日志包含

等条目
2016-06-20 15:39:12 [scrapy] ERROR: Error processing {'file_urls': 'http://example.com/page',
 'title': u'PageTitle'}
Traceback (most recent call last):
  File "c:\anaconda3\envs\scrapy\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "c:\anaconda3\envs\scrapy\lib\site-packages\scrapy\pipelines\media.py", line 44, in process_item
    requests = arg_to_iter(self.get_media_requests(item, info))
  File "c:\anaconda3\envs\scrapy\lib\site-packages\scrapy\pipelines\files.py", line 365, in get_media_requests
    return [Request(x) for x in item.get(self.files_urls_field, [])]
  File "c:\anaconda3\envs\scrapy\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__
    self._set_url(url)
  File "c:\anaconda3\envs\scrapy\lib\site-packages\scrapy\http\request\__init__.py", line 57, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: h

关于此错误的其他问题似乎与start_urls的问题有关,但我的启动网址很好,因为蜘蛛爬过网站,它只是不将页面保存到我指定的files_store。

我使用file_urls

填充item['file_urls'] = response.url

我是否需要以不同的方式指定网址?

0 个答案:

没有答案