我正在尝试运行一个抓抓的项目,该项目需要多个蜘蛛并将结果馈入单个JSON文件中。我正在使用Scrapy文档中的CrawlerRunner,文件名为 base.py -
import scrapy
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings
from my_project.spiders.spider1 import Spider1Spider
from my_project.spiders.spider2 import Spider2Spider
settings = get_project_settings()
runner = CrawlerRunner(settings)
@defer.inlineCallbacks
def crawl():
yield runner.crawl(Spider1Spider)
yield runner.crawl(Spider2Spider)
reactor.stop()
crawl()
reactor.run()
在 settings.py 中,我有
FEED_FORMAT = 'json'
FEED_URI = 'filename.json'
但是当我运行 base.py 时,JSON文件带有方括号-] [-在Spider1和Spider2的结果之间,使其成为无效的JSON文件。
使用Scrapy示例, spider1.py
import scrapy
class Spider1Spider(scrapy.Spider):
name = 'spider1'
start_urls = ['http://quotes.toscrape.com/page/1/']
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first()
}
和 spider2.py
import scrapy
class Spider2Spider(scrapy.Spider):
name = 'spider2'
start_urls = ['http://quotes.toscrape.com/page/2/']
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'author': quote.css('small.author::text').extract_first()
}
给出结果:
[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"}
][
{"author": "Marilyn Monroe"},
{"author": "J.K. Rowling"}
]
预期结果会
[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"},
{"author": "Marilyn Monroe"},
{"author": "J.K. Rowling"}
]
我想念什么?我是否需要项目管道才能将多个蜘蛛的结果获取到一个有效的JSON文件中?
编辑
这是我玩过的管道,但目前仅产生一半的结果,仅author
个结果
import json
class TestPipeline(object):
def open_spider(self, spider):
self.file = open('file_name.json', 'wb')
self.file.write("[")
def close_spider(self, spider):
self.file.write("]")
self.file.close()
def process_item(self, item, spider):
line = json.dumps(
dict(item),
sort_keys=True,
indent=4,
separators=(',', ': ')
) + ",\n"
self.file.write(line)
return item
我做错了什么?