我目前正在使用以下命令行参数的Scrapy:
scrapy crawl my_spider -o data.json
但是,我更喜欢在Python脚本中“保存”此命令。在https://doc.scrapy.org/en/latest/topics/practices.html之后,我有以下脚本:
import scrapy
from scrapy.crawler import CrawlerProcess
from apkmirror_scraper.spiders.sitemap_spider import ApkmirrorSitemapSpider
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(ApkmirrorSitemapSpider)
process.start() # the script will block here until the crawling is finished
但是,从文档中我不清楚在-o data.json
命令行参数的等效内容应该在脚本中。如何让脚本生成JSON文件?
答案 0 :(得分:10)
您需要将FEED_FORMAT
和FEED_URI
添加到CrawlerProcess
:
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
'FEED_FORMAT': 'json',
'FEED_URI': 'data.json'
})