当我将我的Scrapy运行到亚马逊时,我在重定向的URL中遇到了丢失方案的错误。如何确保每个重定向网址都有http
?
2016-11-17 07:16:22 [scrapy] ERROR: Spider error processing <GET https://www.amazon.com/b/ref=lp_3610851_ln_1?node=3752871&ie=UTF8&qid=1479333096> (referer: None)
Traceback (most recent call last):
File "D:\Kerja\HIT\PYTHON~1\<project_name>\<project_name>\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "D:\Kerja\HIT\PYTHON~1\<project_name>\<project_name>\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "D:\Kerja\HIT\PYTHON~1\<project_name>\<project_name>\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "D:\Kerja\HIT\PYTHON~1\<project_name>\<project_name>\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "D:\Kerja\HIT\PYTHON~1\<project_name>\<project_name>\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "D:\Kerja\HIT\Python Projects\<project_name>\<project_name>\<project_name>\<project_name>\spiders\amazon.py", line 133, in parse
yield scrapy.Request(url, callback=self.parse_items, meta=response.meta)
File "D:\Kerja\HIT\PYTHON~1\<project_name>\<project_name>\lib\site-packages\scrapy\http\request\__init__.py", line 25, in __init__
self._set_url(url)
File "D:\Kerja\HIT\PYTHON~1\<project_name>\<project_name>\lib\site-packages\scrapy\http\request\__init__.py", line 57, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: /gp/slredirect/picassoRedirect.html/ref=pa_sp_btf_browse_lawngarden_sr_pg1_1?ie=UTF8&adId=A00535191JPLEGR67F8IR&url=https%3A%2F%2Fwww.amazon.com%2FRock-Solid-Supplement-Flowering-Hydroponic%2Fdp%2FB00YBHBKP2%2Fref%3Dlp_3752871_1_25%2F161-3753912-4487915%3Fs%3Dlawn-garden%26ie%3DUTF8%26qid%3D1479341778%26sr%3D1-25-spons%26psc%3D1&qualifier=1479341778&id=6512557339213691&widgetName=sp_btf_browse
更新
我回顾了Scrapy中的基本重定向中间件,我发现它已经包含了这个:
location = safe_url_string(response.headers['location'])
redirected_url = urljoin(request.url, location)
所以逻辑上它应该已经修复了重定向URL。为什么我仍然破坏了重定向网址?
更新
我已经在我的收益率中使用urljoin
。
def parse(self, response):
for url in response.xpath(
'//div[@id="mainResults"]//a[h2/@data-attribute]/@href'
).extract():
yield scrapy.Request(response.urljoin(url), callback=self.parse_items, meta=response.meta)
答案 0 :(得分:4)
我不认为这与重定向有关。
这是你应该研究的地方:
File "D:\Kerja\HIT\Python Projects\<project_name>\<project_name>\<project_name>\<project_name>\spiders\amazon.py", line 133, in parse
yield scrapy.Request(url, callback=self.parse_items, meta=response.meta)
您的parse
回调正在产生scrapy.Request
个url
个实例,其中http://
尚未完整 - 它在开头时缺少https://
或scrapy.Request
, .urljoin()
初始化抱怨它。
Scrapy Response
objects have a helper method调用/gp/slredirect/picassoRedirect.html...
来构建来自 response.urljoin(url)
等相对位置的完整绝对网址:
def parse(self, response):
...
yield scrapy.Request(response.urljoin(url), callback=self.parse_items, meta=response.meta)
因此,我建议您将代码的这一部分更改为:
solo.pressSpinnerItem(0, -5); //selects the item in the spinner
答案 1 :(得分:2)
尝试这样的事情:
for url in urls:
myRequest = Request("http://www.amazon.com" + url.pop(0), callback=self.whateverfunction)
yeild myRequest