我试图抓取一个要求删除Connection: close
标题的网站。但默认情况下scrapy会为所有请求添加Connection: close
标头,并且无法覆盖。
因此,我尝试使用自定义请求来生成没有Connection: close
标头的请求。但是我收到了错误。
有没有办法使用自定义请求而不是scrapy.Request
或子类scrapy.Request来删除Connection: close
标题?
Scrapy spider:
import scrapy
import requests
class AdidasSpider(scrapy.Spider):
name = "adidas"
def start_requests(self):
url = 'http://www.adidas.com/us/men-shoes'
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Host": "www.adidas.com",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
yield requests.get(url, headers=headers, hooks={'response': self.parse})
def parse(r, *args, **kwargs):
print r
我收到了很多错误:
2018-01-28 14:58:08 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method ?.request_scheduled of <scrapy.spidermiddlewares.referer.RefererMiddleware object at 0x04D30C30>>
Traceback (most recent call last):
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\utils\signal.py", line 30, in send_catch_log
*arguments, **named)
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\pydispatch\robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 343, in request_scheduled
redirected_urls = request.meta.get('redirect_urls', [])
AttributeError: 'Response' object has no attribute 'meta'
Unhandled Error
Traceback (most recent call last):
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\commands\crawl.py", line 58, in run
self.crawler_process.start()
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\crawler.py", line 285, in start
reactor.run(installSignalHandlers=False) # blocking call
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\twisted\internet\base.py", line 1243, in run
self.mainLoop()
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\twisted\internet\base.py", line 1252, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\twisted\internet\base.py", line 878, in runUntilCurrent
call.func(*call.args, **call.kw)
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\core\engine.py", line 135, in _next_request
self.crawl(request, spider)
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\core\engine.py", line 210, in crawl
self.schedule(request, spider)
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\core\engine.py", line 216, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\core\scheduler.py", line 54, in enqueue_request
if not request.dont_filter and self.df.request_seen(request):
exceptions.AttributeError: 'Response' object has no attribute 'dont_filter'
2018-01-28 14:58:08 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\commands\crawl.py", line 58, in run
self.crawler_process.start()
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\crawler.py", line 285, in start
reactor.run(installSignalHandlers=False) # blocking call
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\twisted\internet\base.py", line 1243, in run
self.mainLoop()
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\twisted\internet\base.py", line 1252, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\twisted\internet\base.py", line 878, in runUntilCurrent
call.func(*call.args, **call.kw)
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\utils\reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\core\engine.py", line 135, in _next_request
self.crawl(request, spider)
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\core\engine.py", line 210, in crawl
self.schedule(request, spider)
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\core\engine.py", line 216, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "d:\work\freelance\shopify_monitor\venv\lib\site-packages\scrapy\core\scheduler.py", line 54, in enqueue_request
if not request.dont_filter and self.df.request_seen(request):
exceptions.AttributeError: 'Response' object has no attribute 'dont_filter'
答案 0 :(得分:0)
使用
yield scrapy.Request(url, headers=headers, callback=self.parse)
requests
会向您返回一个回复,在这种情况下,您只需yield
将其转移到start_requests
。这就是您收到与回复相关的错误的原因。