Question

我正在运行一个带有两个蜘蛛的进程来爬行页面，但是我得到的是一个空的json输出文件。

代码如下：

Spiders.py

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from datetime import datetime
import uuid
from BourseCrawler.items import IndiceItem, FirmsItem
import json
from datetime import time


class FirmsSetSpider(scrapy.Spider):
    # first spider


class IndiceSetSpider(scrapy.Spider):
    # second spider 



# To get the output in a CSV Format
settings = get_project_settings()


process = CrawlerProcess(settings)
process.crawl(FirmsSetSpider)
process.crawl(IndiceSetSpider)
process.start()  # the script will block here until all crawling jobs are finished # noqa

我还编辑了我的设置文件夹 settings.py

ITEM_PIPELINES = {
    'BourseCrawler.pipelines.JsonWriterPipeline': 300,
 }

并添加了json命令行

pipelines.py

import json


class JsonWriterPipeline(object):

    def open_spider(self, spider):
        self.file = open('scrape_1.json', 'w')

    def process_item(self, item, spider):
        line = json.dumps(dict(item)) + "\n"
        self.file.write(line)
        return item

    def close_spider(self, spider):
        self.file.close()

无法从抓取的导出数据中获取有关Json的值

0 个答案: