在Scrapy中指定要根据抓取的结果将数据导出到哪个文件

时间:2017-03-06 21:59:41

标签: python scrapy scrapy-spider

当我启动scrapy蜘蛛时,如何创建一组文件:

year1.csv
year2.csv
year3.csv

如果文件存在且文件中包含内容,也要清除该文件。

在解析过程中,根据scrapy结果导出到每个文件:

def parse(self,response):
 if response.css('#Contact1'):
  yield{
   'Name': response.css('#ContactName1 a::text').extract_first()
  }

 if response.css('#Contact1').extract_first() is "1":
  export to year1.csv
 if response.css('#Contact1').extract_first() is "2":
  export to year2.csv
 if response.css('#Contact1').extract_first() is "2":
  export to year3.csv

1 个答案:

答案 0 :(得分:0)

您可以使用管道来执行此操作。这是官方文件:https://doc.scrapy.org/en/latest/topics/item-pipeline.html

这是我将如何去做。 我会为不同的文档创建一个不同的项目

item.py

class Year1Item():
    name = scrapy.field()
class Year2Item():
    name = scrapy.field()
class Year3Item():
    name = scrapy.field()

然后在你的蜘蛛文件中你可以做到这一点

def parse(self,response):
  if response.css('#Contact1'):
    if response.css('#Contact1').extract_first() is "1":
        item = Year1Item()
    if response.css('#Contact1').extract_first() is "2":
        item = Year2Item()
    if response.css('#Contact1').extract_first() is "2":
        item = Year3Item()
    item['Name'] = response.css('#ContactName1 a::text').extract_first()
    return item

然后在你的pipeline.py文件中

 def process_item(self, item, spider):
     if isinstance(item,Year1Item):
         export to year1.csv
     if isinstance(item,Year2Item):
         export to year2.csv
     if isinstance(item,Year3Item):
         export to year3.csv

在您的管道文件中,您可以拥有一个在蜘蛛打开时运行的功能

def open_spider(self,spider):
    #maybe here you could use python to check if the files already exist and delete them if they do