我正在尝试从脚本运行scrapy而我无法让程序创建导出文件
我试图以两种不同的方式导出文件:
当我从命令行运行scrapy时,这两种方式都有效,但是当我从脚本运行scrapy时它们都不起作用。
我并不是唯一遇到这个问题的人。以下是另外两个类似的未回答的问题。在我发布问题之前,我没有注意到这些。
这是我从脚本运行scrapy的代码。它包括使用管道和Feed导出器打印输出文件的设置。
from twisted.internet import reactor
from scrapy import log, signals
from scrapy.crawler import Crawler
from scrapy.xlib.pydispatch import dispatcher
import logging
from external_links.spiders.test import MySpider
from scrapy.utils.project import get_project_settings
settings = get_project_settings()
#manually set settings here
settings.set('ITEM_PIPELINES',{'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline':200},priority='cmdline')
settings.set('DEPTH_LIMIT',1,priority='cmdline')
settings.set('LOG_FILE','Log.log',priority='cmdline')
settings.set('FEED_URI','output.csv',priority='cmdline')
settings.set('FEED_FORMAT', 'csv',priority='cmdline')
settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
settings.set('FEED_STORE_EMPTY',True,priority='cmdline')
def stop_reactor():
reactor.stop()
dispatcher.connect(stop_reactor, signal=signals.spider_closed)
spider = MySpider()
crawler = Crawler(settings)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start(loglevel=logging.DEBUG)
log.msg('reactor running...')
reactor.run()
log.msg('Reactor stopped...')
在我运行此代码后,日志显示:“存储csv feed(341项):output.csv”,但没有找到output.csv。
这是我的Feed导出代码:
settings = get_project_settings()
#manually set settings here
settings.set('ITEM_PIPELINES', {'external_links.pipelines.FilterPipeline':100,'external_links.pipelines.CsvWriterPipeline': 200},priority='cmdline')
settings.set('DEPTH_LIMIT',1,priority='cmdline')
settings.set('LOG_FILE','Log.log',priority='cmdline')
settings.set('FEED_URI','output.csv',priority='cmdline')
settings.set('FEED_FORMAT', 'csv',priority='cmdline')
settings.set('FEED_EXPORTERS',{'csv':'external_links.exporter.CsvOptionRespectingItemExporter'},priority='cmdline')
settings.set('FEED_STORE_EMPTY',True,priority='cmdline')
from scrapy.contrib.exporter import CsvItemExporter
class CsvOptionRespectingItemExporter(CsvItemExporter):
def __init__(self, *args, **kwargs):
delimiter = settings.get('CSV_DELIMITER', ',')
kwargs['delimiter'] = delimiter
super(CsvOptionRespectingItemExporter, self).__init__(*args, **kwargs)
这是我的管道代码:
import csv
class CsvWriterPipeline(object):
def __init__(self):
self.csvwriter = csv.writer(open('items2.csv', 'wb'))
def process_item(self, item, spider): #item needs to be second in this list otherwise get spider object
self.csvwriter.writerow([item['all_links'], item['current_url'], item['start_url']])
return item
答案 0 :(得分:1)
我遇到了同样的问题。
以下是为我工作的内容:
将导出uri放入settings.py
FEED_URI='file:///tmp/feeds/filename.jsonlines'
使用以下内容在scrape.py
旁边创建scrapy.cfg
脚本
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
process.crawl('yourspidername') #'yourspidername' is the name of one of the spiders of the project.
process.start() # the script will block here until the crawling is finished
运行:python scrape.py
结果:文件已创建。
注意:我的项目中没有管道。所以不确定管道是否会过滤你的结果。
同样:以下是帮助我docs的常见陷阱部分