我有一个URL列表。我想抓取这些。请注意
start_urls
并不是我想要的行为。我希望在单独的爬网会话中逐个运行。以下代码是完整的,损坏的,可复制复制的示例。基本上,它尝试遍历URL列表并在每个URL上启动搜寻器。这基于Common Practices文档。
from urllib.parse import urlparse
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.spiders import CrawlSpider
class MySpider(CrawlSpider):
name = 'my-spider'
def __init__(self, start_url, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [start_url]
self.allowed_domains = [urlparse(start_url).netloc]
urls = [
'http://testphp.vulnweb.com/',
'http://testasp.vulnweb.com/'
]
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
for url in urls:
runner.crawl(MySpider, url)
reactor.run()
上述问题是,它在第一个URL之后挂起;第二个URL永远不会被抓取,此后什么也不会发生:
2018-08-13 20:28:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://testphp.vulnweb.com/> (referer: None)
[...]
2018-08-13 20:28:44 [scrapy.core.engine] INFO: Spider closed (finished)
答案 0 :(得分:2)
reactor.run()
从一开始将永远阻止您的循环。解决此问题的唯一方法是遵守twisted
规则。一种方法是将循环替换为扭曲的特定异步循环,如下所示:
# from twisted.internet.defer import inlineCallbacks
...
@inlineCallbacks
def loop_urls(urls):
for url in urls:
yield runner.crawl(MySpider, url)
reactor.stop()
loop_urls(urls)
reactor.run()
并且这种魔术大致翻译为:
def loop_urls(urls):
url, *rest = urls
dfd = runner.crawl(MySpider, url)
# crawl() returns a deferred to which a callback (or errback) can be attached
dfd.addCallback(lambda _: loop_urls(rest) if rest else reactor.stop())
loop_urls(urls)
reactor.run()
您也可以使用,但它远非漂亮。