来自脚本的Scrapy。不会导出数据

时间:2014-12-19 20:14:44

标签: python-2.7 web web-scraping scrapy twisted.internet

我正在尝试从脚本运行scrapy而我无法让程序创建导出文件

我试图以两种不同的方式导出文件:

  1. 使用管道
  2. 使用Feed导出。
  3. 当我从命令行运行scrapy时,这两种方式都有效,但是当我从脚本运行scrapy时它们都不起作用。

    我并不是唯一遇到这个问题的人。以下是另外两个类似的未回答的问题。在我发布问题之前,我没有注意到这些。

    1. JSON not working in scrapy when calling spider through a python script?
    2. Calling scrapy from a python script not creating JSON output file
    3. 这是我从脚本运行scrapy的代码。它包括使用管道和Feed导出器打印输出文件的设置。

      from twisted.internet import reactor
      
      from scrapy import log, signals
      from scrapy.crawler import Crawler
      from scrapy.xlib.pydispatch import dispatcher
      import logging
      
      from external_links.spiders.test import MySpider
      from scrapy.utils.project import get_project_settings
      settings = get_project_settings()
      
      #manually set settings here
      settings.set('ITEM_PIPELINES',{'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline':200},priority='cmdline')
      settings.set('DEPTH_LIMIT',1,priority='cmdline')
      settings.set('LOG_FILE','Log.log',priority='cmdline')
      settings.set('FEED_URI','output.csv',priority='cmdline')
      settings.set('FEED_FORMAT', 'csv',priority='cmdline')
      settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
      settings.set('FEED_STORE_EMPTY',True,priority='cmdline')
      
      def stop_reactor():
          reactor.stop()
      
      dispatcher.connect(stop_reactor, signal=signals.spider_closed)
      spider = MySpider()
      crawler = Crawler(settings)
      crawler.configure()
      crawler.crawl(spider)
      crawler.start()
      log.start(loglevel=logging.DEBUG)
      log.msg('reactor running...')
      reactor.run()
      log.msg('Reactor stopped...')
      

      在我运行此代码后,日志显示:“存储csv feed(341项):output.csv”,但没有找到output.csv。

      这是我的Feed导出代码:

      settings = get_project_settings()
      
      #manually set settings here
      settings.set('ITEM_PIPELINES',   {'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline': 200},priority='cmdline')
      settings.set('DEPTH_LIMIT',1,priority='cmdline')
      settings.set('LOG_FILE','Log.log',priority='cmdline')
      settings.set('FEED_URI','output.csv',priority='cmdline')
      settings.set('FEED_FORMAT', 'csv',priority='cmdline')
      settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
      settings.set('FEED_STORE_EMPTY',True,priority='cmdline')
      
      
      from scrapy.contrib.exporter import CsvItemExporter
      
      
      class CsvOptionRespectingItemExporter(CsvItemExporter):
      
          def __init__(self, *args, **kwargs):
              delimiter = settings.get('CSV_DELIMITER', ',')
              kwargs['delimiter'] = delimiter
              super(CsvOptionRespectingItemExporter, self).__init__(*args, **kwargs)
      

      这是我的管道代码:

      import csv
      
      class CsvWriterPipeline(object):
      
      def __init__(self):
          self.csvwriter = csv.writer(open('items2.csv', 'wb'))
      
      def process_item(self, item, spider): #item needs to be second in this list otherwise get spider object
          self.csvwriter.writerow([item['all_links'], item['current_url'], item['start_url']])
      
          return item
      

1 个答案:

答案 0 :(得分:1)

我遇到了同样的问题。

以下是为我工作的内容:

  1. 将导出uri放入settings.py

    FEED_URI='file:///tmp/feeds/filename.jsonlines'

  2. 使用以下内容在scrape.py旁边创建scrapy.cfg脚本

     
    from scrapy.crawler import CrawlerProcess
    from scrapy.utils.project import get_project_settings
    
    
    process = CrawlerProcess(get_project_settings())
    
    process.crawl('yourspidername') #'yourspidername' is the name of one of the spiders of the project.
    process.start() # the script will block here until the crawling is finished
    
    
  3. 运行:python scrape.py

  4. 结果:文件已创建。

    注意:我的项目中没有管道。所以不确定管道是否会过滤你的结果。

    同样:以下是帮助我docs的常见陷阱部分