Question

我想做的事情

我想使用Python的Scrapy蜘蛛制作json文件。我目前正在研究“使用Python和JavaScript进行数据可视化”。在抓取中，不知道为什么不创建json文件的原因。

目录结构

/root
nobel_winners   scrapy.cfg

/nobel_winners:
__init__.py     items.py    pipelines.py    spiders
__pycache__     middlewares.py    settings.py

/nobel_winners/spiders:
__init__.py     __pycache__     nwinners_list_spider.py

工作流程/代码

在/ nobel_winners / spiders的nwinners_list_spider.py中输入以下代码。

#encoding:utf-8

import scrapy

class NWinnerItem(scrapy.Item):
    country = scrapy.Field()

class NWinnerSpider(scrapy.Spider):
    name = 'nwinners_list'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ["https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country"]

    def parse(self, response):

        h2s = response.xpath('//h2')

        for h2 in h2s:
            country = h2.xpath('span[@class="mw-headline"]/text()').extract()

在根目录中输入以下代码。

scrapy crawl nwinners_list -o nobel_winners.json

错误

出现以下显示，并且没有在json文件中输入任何数据。

2018-07-25 10:01:53 [scrapy.core.engine] INFO: Spider opened
2018-07-25 10:01:53 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

我尝试过的

1。在本文中，它是更长的来源，但我只检查了“国家”变量。

2。我进入了Scrapy外壳，并使用基于IPython的外壳检查了每个外壳的运动。并且证实了该值牢固地位于“国家”中。

h2s = response.xpath('//h2')

for h2 in h2s:
    country = h2.xpath('span[@class="mw-headline"]/text()').extract()
    print(country)

Answer 1

尝试使用此代码：

import scrapy

class NWinnerItem(scrapy.Item):
    country = scrapy.Field()

class NWinnerSpider(scrapy.Spider):
    name = 'nwinners_list'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ["https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country"]

    def parse(self, response):

        h2s = response.xpath('//h2')

        for h2 in h2s:
            yield NWinnerItem(
                country = h2.xpath('span[@class="mw-headline"]/text()').extract_first()
            )

然后运行 scrapy crawl nwinners_list -o nobel_winners.json -t json

在回调函数中，您解析响应（网页）并返回带有提取的数据，Item对象，Request对象或这些对象的可迭代对象的 dict 。 See scrapy documentation

这就是为什么刮掉0件物品的原因，您需要将它们退回！

还要注意，.extract()返回一个基于xpath查询的列表，而.extract_first()返回该列表的第一个元素。

JSON文件不是使用Python Scrapy Spider创建的

1 个答案: