在Scrapy重定向中,获取ValueError:请求URL中缺少方案:

时间:2016-11-17 00:31:43

标签: python url redirect scrapy url-redirection

当我将我的Scrapy运行到亚马逊时,我在重定向的URL中遇到了丢失方案的错误。如何确保每个重定向网址都有http

2016-11-17 07:16:22 [scrapy] ERROR: Spider error processing <GET https://www.amazon.com/b/ref=lp_3610851_ln_1?node=3752871&ie=UTF8&qid=1479333096> (referer: None)
Traceback (most recent call last):
  File "D:\Kerja\HIT\PYTHON~1\<project_name>\<project_name>\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "D:\Kerja\HIT\PYTHON~1\<project_name>\<project_name>\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
    for x in result:
  File "D:\Kerja\HIT\PYTHON~1\<project_name>\<project_name>\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "D:\Kerja\HIT\PYTHON~1\<project_name>\<project_name>\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "D:\Kerja\HIT\PYTHON~1\<project_name>\<project_name>\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "D:\Kerja\HIT\Python Projects\<project_name>\<project_name>\<project_name>\<project_name>\spiders\amazon.py", line 133, in parse
    yield scrapy.Request(url, callback=self.parse_items, meta=response.meta)
  File "D:\Kerja\HIT\PYTHON~1\<project_name>\<project_name>\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__
    self._set_url(url)
  File "D:\Kerja\HIT\PYTHON~1\<project_name>\<project_name>\lib\site-packages\scrapy\http\request\__init__.py", line 57, in _set_url
    raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: /gp/slredirect/picassoRedirect.html/ref=pa_sp_btf_browse_lawngarden_sr_pg1_1?ie=UTF8&adId=A00535191JPLEGR67F8IR&url=https%3A%2F%2Fwww.amazon.com%2FRock-Solid-Supplement-Flowering-Hydroponic%2Fdp%2FB00YBHBKP2%2Fref%3Dlp_3752871_1_25%2F161-3753912-4487915%3Fs%3Dlawn-garden%26ie%3DUTF8%26qid%3D1479341778%26sr%3D1-25-spons%26psc%3D1&qualifier=1479341778&id=6512557339213691&widgetName=sp_btf_browse

更新

我回顾了Scrapy中的基本重定向中间件,我发现它已经包含了这个:

    location = safe_url_string(response.headers['location'])

    redirected_url = urljoin(request.url, location)

所以逻辑上它应该已经修复了重定向URL。为什么我仍然破坏了重定向网址?

更新

我已经在我的收益率中使用urljoin

def parse(self, response):
    for url in response.xpath(
        '//div[@id="mainResults"]//a[h2/@data-attribute]/@href'
        ).extract():
        yield scrapy.Request(response.urljoin(url), callback=self.parse_items, meta=response.meta)

2 个答案:

答案 0 :(得分:4)

我不认为这与重定向有关。

这是你应该研究的地方:

  File "D:\Kerja\HIT\Python Projects\<project_name>\<project_name>\<project_name>\<project_name>\spiders\amazon.py", line 133, in parse
    yield scrapy.Request(url, callback=self.parse_items, meta=response.meta)

您的parse回调正在产生scrapy.Requesturl个实例,其中http://尚未完整 - 它在开头时缺少https://scrapy.Request.urljoin()初始化抱怨它。

Scrapy Response objects have a helper method调用/gp/slredirect/picassoRedirect.html...来构建来自 response.urljoin(url) 等相对位置的完整绝对网址:

    def parse(self, response):
        ...
        yield scrapy.Request(response.urljoin(url), callback=self.parse_items, meta=response.meta)

因此,我建议您将代码的这一部分更改为:

solo.pressSpinnerItem(0, -5);     //selects the item in the spinner

答案 1 :(得分:2)

尝试这样的事情:

for url in urls:    
myRequest = Request("http://www.amazon.com" + url.pop(0), callback=self.whateverfunction)   
yeild myRequest