我正在尝试同时抓取2个域。我创造了一个像这样的蜘蛛:
class TestSpider(CrawlSpider):
name = 'test-spider'
allowed_domains = [ 'domain-a.com', 'domain-b.com' ]
start_urls = [ 'http://www.domain-a.com/index.html',
'http://www.domain-b.com/index.html' ]
rules = (
Rule(LinkExtractor(), follow=True, callback='parse_item'),
)
def parse_item(self, response):
log.msg('parsing ' + response.url, log.DEBUG)
我希望看到“domain-a.com”和“domain-b.com”的组合。输出中的条目但我只看到日志中提到的domain-a。但是,如果我运行单独的蜘蛛/爬虫我确实看到两个域同时被刮掉(不是实际的代码,但说明了这一点):
def setup_crawler(url):
spider = TestSpider(start_url=url)
crawler = Crawler(get_project_settings())
crawler.configure()
crawler.signals.connect(reactor.stop(), signal=signals.spider_closed)
crawler.crawl(spider)
crawler.start()
setup_crawler('http://www.domain-a.com/index.html')
setup_crawler('http://www.domain-b.com/index.html')
log.start(loglevel=log.DEBUG)
reactor.run()
由于