我正在运行一个带有两个蜘蛛的进程来爬行页面,但是我得到的是一个空的json输出文件。
代码如下:
Spiders.py
import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from datetime import datetime
import uuid
from BourseCrawler.items import IndiceItem, FirmsItem
import json
from datetime import time
class FirmsSetSpider(scrapy.Spider):
# first spider
class IndiceSetSpider(scrapy.Spider):
# second spider
# To get the output in a CSV Format
settings = get_project_settings()
process = CrawlerProcess(settings)
process.crawl(FirmsSetSpider)
process.crawl(IndiceSetSpider)
process.start() # the script will block here until all crawling jobs are finished # noqa
我还编辑了我的设置文件夹 settings.py
ITEM_PIPELINES = {
'BourseCrawler.pipelines.JsonWriterPipeline': 300,
}
并添加了json命令行
pipelines.py
import json
class JsonWriterPipeline(object):
def open_spider(self, spider):
self.file = open('scrape_1.json', 'w')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
def close_spider(self, spider):
self.file.close()