我有一只蜘蛛,我必须使用Selenium来抓取页面上的动态数据。这是它的样子:
class MySpider(
name = 'myspider'
start_urls = ['http://example.org']
def __init__(self, *args, **kwargs):
super(, self).__init__(*args, **kwargs)
self.driver = webdriver.Firefox()
self.driver.implicitly_wait(5)
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
if self.driver:
self.driver.quit()
self.driver = None
这里的问题是,当我取消Scrapyd的工作时,它不会停止,直到我手动关闭窗口。当我将蜘蛛部署到真实服务器时,我显然无法做到这一点。
这是我每次点击时在Scrapyd日志中看到的内容"取消":
2015-08-12 13:48:13+0300 [HTTPChannel,208,127.0.0.1] Unhandled Error
Traceback (most recent call last):
File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/twisted/web/http.py", line 1731, in allContentReceived
req.requestReceived(command, path, version)
File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/twisted/web/http.py", line 827, in requestReceived
self.process()
File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/twisted/web/server.py", line 189, in process
self.render(resrc)
File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/twisted/web/server.py", line 238, in render
body = resrc.render(self)
--- <exception caught here> ---
File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/scrapyd/webservice.py", line 18, in render
return JsonResource.render(self, txrequest)
File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/scrapy/utils/txweb.py", line 10, in render
r = resource.Resource.render(self, txrequest)
File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/twisted/web/resource.py", line 250, in render
return m(request)
File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/scrapyd/webservice.py", line 55, in render_POST
s.transport.signalProcess(signal)
File "/home/dmitry/.virtualenvs/myproject/local/lib/python2.7/site-packages/twisted/internet/process.py", line 339, in signalProcess
raise ProcessExitedAlready()
twisted.internet.error.ProcessExitedAlready:
但是这份工作仍然在工作清单中,并且标记为&#34;正在运行&#34;。那么如何关闭驱动程序呢?
答案 0 :(得分:0)
导入SignalManager
:
from scrapy.signalmanager import SignalManager
然后替换:
dispatcher.connect(self.spider_closed, signals.spider_closed)
使用:
SignalManager(dispatcher.Any).connect(self.spider_closed, signal=signals.spider_closed)
答案 1 :(得分:0)
您是否尝试在蜘蛛上实施from_crawler?我只在管道和扩展上完成了这项工作,但对于蜘蛛来说它应该是一样的。
>>> import xml.etree.ElementTree as ET
>>> root = ET.fromstring('''\
... <volume name="sp" type="span" operation="create">
... <driver>HDD1</driver>
... <driver>HDD2</driver>
... <driver>HDD3</driver>
... <driver>HDD4</driver>
... </volume>
... ''')
>>> for nod in root.findall("./driver"):
... print ET.tostring(nod)
...
<driver>HDD1</driver>
<driver>HDD2</driver>
<driver>HDD3</driver>
<driver>HDD4</driver>
http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.from_crawler