我有一个数据对象列表,每个对象都包含一个要删除的URL。其中一些网址无效,但我仍然希望数据对象能够通过项目管道。
在@tomáš-linhart回复之后我明白在这种情况下使用中间件不起作用,因为scrapy不允许我首先创建请求对象。
另一种方法是,如果url无效,则返回item而不是request。
以下是我的代码:
def start_requests(self):
rurls = json.load(open(self.data_file))
for data in rurls[:100]:
url = data['Website'] or ''
rid = data['id']
# skip creating requests for invalid urls
if not (url and validators.url(url)):
yield self.create_item(rid, url)
continue
# create request object
request_object = scrapy.Request(url=url, callback=self.parse, errback=self.errback_httpbin)
# populate request object
request_object.meta['rid'] = rid
self.logger.info('REQUEST QUEUED for RID: %s', rid)
yield request_object
上面的代码抛出错误,如图所示。不仅仅是错误,我不确定如何追踪问题的起源。 :(
2017-09-22 12:44:38 [scrapy.utils.signal] ERROR: Error caught on signal handler: <bound method RefererMiddleware.request_scheduled of <scrapy.spidermiddlewares.referer.RefererMiddleware object at 0x10f603ef0>>
Traceback (most recent call last):
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/utils/signal.py", line 30, in send_catch_log
*arguments, **named)
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/pydispatch/robustapply.py", line 55, in robustApply
return receiver(*arguments, **named)
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/spidermiddlewares/referer.py", line 343, in request_scheduled
redirected_urls = request.meta.get('redirect_urls', [])
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/item.py", line 74, in __getattr__
raise AttributeError(name)
AttributeError: meta
Unhandled Error
Traceback (most recent call last):
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/commands/crawl.py", line 58, in run
self.crawler_process.start()
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/crawler.py", line 285, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 1243, in run
self.mainLoop()
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 1252, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 878, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/utils/reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 135, in _next_request
self.crawl(request, spider)
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 210, in crawl
self.schedule(request, spider)
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 216, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/scheduler.py", line 54, in enqueue_request
if not request.dont_filter and self.df.request_seen(request):
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/item.py", line 74, in __getattr__
raise AttributeError(name)
builtins.AttributeError: dont_filter
2017-09-22 12:44:38 [twisted] CRITICAL: Unhandled Error
Traceback (most recent call last):
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/commands/crawl.py", line 58, in run
self.crawler_process.start()
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/crawler.py", line 285, in start
reactor.run(installSignalHandlers=False) # blocking call
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 1243, in run
self.mainLoop()
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 1252, in mainLoop
self.runUntilCurrent()
--- <exception caught here> ---
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/twisted/internet/base.py", line 878, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/utils/reactor.py", line 41, in __call__
return self._func(*self._a, **self._kw)
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 135, in _next_request
self.crawl(request, spider)
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 210, in crawl
self.schedule(request, spider)
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/engine.py", line 216, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/core/scheduler.py", line 54, in enqueue_request
if not request.dont_filter and self.df.request_seen(request):
File "/myhome/.virtualenvs/myproj/lib/python3.5/site-packages/scrapy/item.py", line 74, in __getattr__
raise AttributeError(name)
builtins.AttributeError: dont_filter
答案 0 :(得分:0)
使用当前方法无法实现目标,因为您在Request
的构造函数中引发了错误,请参阅code。
无论如何,我不明白为什么你甚至想要这样做。根据您的要求:
我有一个数据对象列表,每个对象都包含一个要删除的URL。其中一些网址无效,但我仍然希望数据对象能够通过项目管道。
如果我理解正确,您已经有了一个完整的项目(术语中的数据对象),您只希望它通过项目管道。然后在蜘蛛中进行URL验证,如果它无效,只需产生项目而不是产生对它包含的URL的请求。不需要蜘蛛中间件。
答案 1 :(得分:0)
您无法从start_requests方法中生成Item对象。 只有Request对象。