我正在尝试学习Scrapy,并一目了然地执行Scrapy中显示的程序,以及以下教程。
使用-o file.json -t json
运行命令会生成一个充满空字段的json文件:
$ cat test.json
[{},
{},
{},
{},
{},
{},
{},
{},
...
{},
{},
{}]
这似乎不是因为数据不存在,因为如果我的输出是CSV文件,则会保存内容:
body,votes,tags,link,title
"<div class=""post-text"" itemprop=""text"">
<p>Here is a piece of <strong>C++</strong> code that seems very peculiar. For some strange reason, sorting the data miraculously makes the code almost six times faster.</p>
<pre class=""lang-cpp prettyprint-override""><code>#include <algorithm>
...
我确切地说,在我大学的计算机上,它不应该是Scrapy配置的问题。
有谁知道我错过了一个有效的json文件?或者json导出是否已损坏,必须通过管道手动完成?文档坚持认为,只需要输出输出就不需要管道。
编辑:
第一个例子的代码:
import scrapy
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['http://stackoverflow.com/questions?sort=votes']
def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
yield {
'title': response.css('h1 a::text').extract()[0],
'votes': response.css('.question .vote-count-post::text').extract()[0],
'body': response.css('.question .post-text').extract()[0],
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url
}
第二个例子的代码:
items.py:
class TeststackoverflowItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
name = scrapy.Field()
subject = scrapy.Field()
score = scrapy.Field()
bestAnswer = scrapy.Field()
spider.py
import scrapy
from testStackOverflow.items import TeststackoverflowItem
class SpiderStackOverflow (scrapy.Spider):
name = "stackoverflow"
allowed_domains = ['stackoverflow.com']
#though it looks like it gives the same result without ?sort=votes, it's not the case with a wget
start_urls = ['http://stackoverflow.com/questions?sort=votes']
def parse(self, response):
for div in response.css('div.question-summary h3 a::attr(href)'):
full_url = response.urljoin(div.extract())
#print ('full url: ', full_url)
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
item = TeststackoverflowItem()
item['name'] = response.xpath('//title/text()').extract() #works
item['subject'] = response.css('.post-text').extract()
item['score'] = response.css('.question .vote-count-post::text').extract()[0] #works
item['bestAnswer'] = response.css('.answercell').extract()[0]
yield item
第二个例子的Shell最终输出:
2016-04-01 16:14:51+0200 [stackoverflow] INFO: Closing spider (finished)
2016-04-01 16:14:51+0200 [stackoverflow] INFO: Stored json feed (50 items) in: top-stackoverflow-questions.json
2016-04-01 16:14:51+0200 [stackoverflow] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 23724,
'downloader/request_count': 51,
'downloader/request_method_count/GET': 51,
'downloader/response_bytes': 2154436,
'downloader/response_count': 51,
'downloader/response_status_count/200': 51,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 4, 1, 14, 14, 51, 769488),
'item_scraped_count': 50,
'log_count/DEBUG': 101,
'log_count/INFO': 4,
'request_depth_max': 1,
'response_received_count': 51,
'scheduler/dequeued': 51,
'scheduler/dequeued/memory': 51,
'scheduler/enqueued': 51,
'scheduler/enqueued/memory': 51,
'start_time': datetime.datetime(2016, 4, 1, 14, 14, 43, 691509)}
2016-04-01 16:14:51+0200 [stackoverflow] INFO: Spider closed (finished)
来源: http://doc.scrapy.org/en/latest/intro/overview.html#walk-through-of-an-example-spider http://doc.scrapy.org/en/latest/intro/tutorial.html