从python / scrapy(python框架)在单个csv文件中将数据写入多个工作表

时间:2012-10-18 11:32:06

标签: python csv scrapy

我正在使用scrapy框架并通过创建两个蜘蛛文件从两个URL获取数据。

现在,例如当我为spider1运行url1时,已抓取的数据将保存到csv1文件中,当我运行第二个spider2时,数据将会是已保存到csv2文件中。

其实我要做的是将不同蜘蛛的所有数据保存到不同表格中的单个csv文件中(表格名称应为蜘蛛名称)

All about my question is how to write data in to multiple sheets in a single csv file from python

pipeline.py

from w3c_browser.items import WCBrowserItem
import csv
from csv import DictWriter
from cStringIO import StringIO
from datetime import datetime
from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
from scrapy import log

class W3CBrowserPipeline(object):
    def __init__(self):
        dispatcher.connect(self.spider_opened, signal=signals.spider_opened)
        dispatcher.connect(self.spider_closed, signal=signals.spider_closed)
        self.brandCategoryCsv = csv.writer(open('wcbbrowser.csv', 'wb'))

    def spider_opened(self, spider):
        spider.started_on = datetime.now()
        if spider.name == 'browser_statistics':
            log.msg("opened spider  %s at time %s" % (spider.name,datetime.now().strftime('%H-%M-%S')))
            self.brandCategoryCsv = csv.writer(open("csv/%s-%s.csv"% (spider.name,datetime.now().strftime('%d%m%y')), "wb"),
                       delimiter=',', quoting=csv.QUOTE_MINIMAL)
        elif spider.name == 'browser_os':
            log.msg("opened spider  %s at time %s" % (spider.name,datetime.now().strftime('%H-%M-%S')))
            self.brandCategoryCsv = csv.writer(open("csv/%s-%s.csv"% (spider.name,datetime.now().strftime('%d%m%y')), "wb"),
                       delimiter=',', quoting=csv.QUOTE_MINIMAL)
        elif spider.name == 'browser_display':
            log.msg("opened spider  %s at time %s" % (spider.name,datetime.now().strftime('%H-%M-%S')))
            self.brandCategoryCsv = csv.writer(open("csv/%s-%s.csv"% (spider.name,datetime.now().strftime('%d%m%y')), "wb"),
                       delimiter=',', quoting=csv.QUOTE_MINIMAL)

    def process_item(self, item, spider):
        if spider.name == 'browser_statistics':
            self.brandCategoryCsv.writerow([item['year'],
                                            item['internet_explorer'],
                                            item['firefox'],
                                            item['chrome'],
                                            item['safari'],
                                            item['opera'],
            ])
            return item

        elif spider.name == 'browser_os':
            def process_item(self, item, spider):
                self.brandCategoryCsv.writerow([item['year'],
                                                item['vista'],
                                                item['nt'],
                                                item['winxp'],
                                                item['linux'],
                                                item['mac'],
                                                item['mobile'],
                ])
                return item

    def spider_closed(self, spider):
        log.msg("closed spider %s at %s" % (spider.name,datetime.now().strftime('%H-%M-%S')))
        work_time = datetime.now() - spider.started_on
        print str(work_time),"Total Time taken by the spider to run>>>>>>>>>>>"

1 个答案:

答案 0 :(得分:0)

我不知道是否有一个漂亮的内置方法来使用scrapy从命令行执行此操作。但是创建自己的pipeline非常简单。管道可以为所有蜘蛛打开相同的文件,并为每个不同的蜘蛛写一个不同的表。这需要你自己实现这个逻辑。