如何通过scrapy将2个蜘蛛的所有结果放在一个XML中?

时间:2015-10-14 13:02:01

标签: xml python-2.7 scrapy scrapy-spider

我用Scrapy制作了2个蜘蛛,我需要将它放在一个脚本中,并将所有结果放在一个XML中。

以下是一些管理2个蜘蛛的方法,但我无法将结果整合到一个XML中。

http://doc.scrapy.org/en/latest/topics/practices.html

有没有办法用2个蜘蛛启动1个脚本并将所有结果收集到一个文件中?

1 个答案:

答案 0 :(得分:0)

在你的scrapy项目中创建一个名为script.py的python脚本添加 下面的代码行。假设蜘蛛文件的名称是spider_one.py和spider_two.py,你的蜘蛛分别是SpiderOne和SpiderTwo。 所以你在script.py中添加。

from spiders.spider_one import SpiderOne
from spiders.spider_two import SpiderTwo

# scrapy api
from scrapy import signals, log
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings


file = "your_file.json" #your results
TO_CRAWL = [SpiderOne,SpiderTwo]


# list of crawlers that are running 
RUNNING_CRAWLERS = []

def spider_closing(spider):
    """Activates on spider closed signal"""
    log.msg("Spider closed: %s" % spider, level=log.INFO)
    RUNNING_CRAWLERS.remove(spider)
    if not RUNNING_CRAWLERS:
        reactor.stop()

log.start(loglevel=log.DEBUG)
for spider in TO_CRAWL:
    settings = Settings()
    settings.set("USER_AGENT", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36")
    settings.set("FEED_FORMAT",'json')
    settings.set("FEED_URI",file)
    # settings.set("ITEM_PIPELINES",{ 'pipelines.CustomPipeline': 300})
    settings.set("DOWNLOAD_DELAY",1)
    crawler = Crawler(settings)
    crawler_obj = spider()
    RUNNING_CRAWLERS.append(crawler_obj)

    # stop reactor when spider closes
    crawler.signals.connect(spider_closing, signal=signals.spider_closed)
    crawler.configure()
    crawler.crawl(crawler_obj)
    crawler.start()

# blocks process so always keep as the last statement
reactor.run()

该示例适用于json,但也适用于xml。