scrapy spider是否同时从多个域下载?

时间:2014-11-06 15:45:22

标签: scrapy

我正在尝试同时抓取2个域。我创造了一个像这样的蜘蛛:

class TestSpider(CrawlSpider):

    name = 'test-spider'
    allowed_domains = [ 'domain-a.com', 'domain-b.com' ]
    start_urls = [ 'http://www.domain-a.com/index.html', 
                   'http://www.domain-b.com/index.html' ] 
    rules = (
        Rule(LinkExtractor(), follow=True, callback='parse_item'),
    )

    def parse_item(self, response):
        log.msg('parsing ' + response.url, log.DEBUG)

我希望看到“domain-a.com”和“domain-b.com”的组合。输出中的条目但我只看到日志中提到的domain-a。但是,如果我运行单独的蜘蛛/爬虫我确实看到两个域同时被刮掉(不是实际的代码,但说明了这一点):

def setup_crawler(url):
    spider = TestSpider(start_url=url)
    crawler = Crawler(get_project_settings())
    crawler.configure()
    crawler.signals.connect(reactor.stop(), signal=signals.spider_closed)
    crawler.crawl(spider)
    crawler.start()

setup_crawler('http://www.domain-a.com/index.html')
setup_crawler('http://www.domain-b.com/index.html')
log.start(loglevel=log.DEBUG)
reactor.run()

由于

0 个答案:

没有答案