使用Scrapy生成的Json文件是空的

时间:2016-04-01 13:38:41

标签: python json scrapy

我正在尝试学习Scrapy,并一目了然地执行Scrapy中显示的程序,以及以下教程。 使用-o file.json -t json运行命令会生成一个充满空字段的json文件:

$ cat test.json  
[{},
{},
{},
{},
{},
{},
{},
{},
...
{},
{},
{}]

这似乎不是因为数据不存在,因为如果我的输出是CSV文件,则会保存内容:

body,votes,tags,link,title  
"<div class=""post-text"" itemprop=""text"">

<p>Here is a piece of <strong>C++</strong> code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster.</p>

<pre class=""lang-cpp prettyprint-override""><code>#include &lt;algorithm&gt;

...

我确切地说,在我大学的计算机上,它不应该是Scrapy配置的问题。

有谁知道我错过了一个有效的json文件?或者json导出是否已损坏,必须通过管道手动完成?文档坚持认为,只需要输出输出就不需要管道。

编辑:
第一个例子的代码:

import scrapy


class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'
    start_urls = ['http://stackoverflow.com/questions?sort=votes']

    def parse(self, response):
        for href in response.css('.question-summary h3 a::attr(href)'):
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response):
        yield {
            'title': response.css('h1 a::text').extract()[0],
            'votes': response.css('.question .vote-count-post::text').extract()[0],
            'body': response.css('.question .post-text').extract()[0],
            'tags': response.css('.question .post-tag::text').extract(),
            'link': response.url
        }

第二个例子的代码:
items.py:

class TeststackoverflowItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    subject = scrapy.Field()
    score = scrapy.Field()
    bestAnswer = scrapy.Field()

spider.py

import scrapy
from testStackOverflow.items import TeststackoverflowItem

class SpiderStackOverflow (scrapy.Spider):
    name = "stackoverflow"
    allowed_domains = ['stackoverflow.com']
    #though it looks like it gives the same result without ?sort=votes, it's not the case with a wget
    start_urls = ['http://stackoverflow.com/questions?sort=votes']

    def parse(self, response):
        for div in response.css('div.question-summary h3 a::attr(href)'):
            full_url = response.urljoin(div.extract())
            #print ('full url: ', full_url)
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response):
        item = TeststackoverflowItem()
        item['name'] = response.xpath('//title/text()').extract() #works
        item['subject'] = response.css('.post-text').extract()
        item['score'] = response.css('.question .vote-count-post::text').extract()[0] #works
        item['bestAnswer'] = response.css('.answercell').extract()[0]
        yield item

第二个例子的Shell最终输出:

2016-04-01 16:14:51+0200 [stackoverflow] INFO: Closing spider (finished)
2016-04-01 16:14:51+0200 [stackoverflow] INFO: Stored json feed (50 items) in: top-stackoverflow-questions.json
2016-04-01 16:14:51+0200 [stackoverflow] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 23724,
     'downloader/request_count': 51,
     'downloader/request_method_count/GET': 51,
     'downloader/response_bytes': 2154436,
     'downloader/response_count': 51,
     'downloader/response_status_count/200': 51,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2016, 4, 1, 14, 14, 51, 769488),
     'item_scraped_count': 50,
     'log_count/DEBUG': 101,
     'log_count/INFO': 4,
     'request_depth_max': 1,
     'response_received_count': 51,
     'scheduler/dequeued': 51,
     'scheduler/dequeued/memory': 51,
     'scheduler/enqueued': 51,
     'scheduler/enqueued/memory': 51,
     'start_time': datetime.datetime(2016, 4, 1, 14, 14, 43, 691509)}
2016-04-01 16:14:51+0200 [stackoverflow] INFO: Spider closed (finished)

来源:   http://doc.scrapy.org/en/latest/intro/overview.html#walk-through-of-an-example-spider   http://doc.scrapy.org/en/latest/intro/tutorial.html

0 个答案:

没有答案