我正在尝试使用div元素来抓取一个网站,并且要迭代地为每个div元素抓取其中的一些数据,并按照其具有的子链接从中抓取更多数据。
这是quote.py的代码
import scrapy
from ..items import QuotesItem
class QuoteSpider(scrapy.Spider):
name = 'quote'
baseurl='http://quotes.toscrape.com'
start_urls = [baseurl]
def parse(self, response):
all_div_quotes=response.css('.quote')
for quote in all_div_quotes:
item=QuotesItem()
title = quote.css('.text::text').extract()
author = quote.css('.author::text').extract()
tags = quote.css('.tag::text').extract()
author_details_url=self.baseurl+quote.css('.author+ a::attr(href)').extract_first()
item['title']=title
item['author']=author
item['tags']=tags
request = scrapy.Request(author_details_url,
callback=self.author_born,
meta={'item':item,'next_url':author_details_url})
yield request
def author_born(self, response):
item=response.meta['item']
next_url = response.meta['next_url']
author_born = response.css('.author-born-date::text').extract()
item['author_born']=author_born
yield scrapy.Request(next_url, callback=self.author_birthplace,
meta={'item':item})
def author_birthplace(self,response):
item=response.meta['item']
author_birthplace= response.css('.author-born-location::text').extract()
item['author_birthplace']=author_birthplace
yield item
这是items.py的代码
import scrapy
class QuotesItem(scrapy.Item):
title = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
author_born = scrapy.Field()
author_birthplace = scrapy.Field()
我运行了命令scrapy crawl quote -o data.json
,但是没有错误消息,并且data.json
为空。我期望在其对应字段中获取所有数据。
你能帮我吗?
答案 0 :(得分:1)
仔细查看您的日志,您将能够找到这样的消息:
DEBUG: Filtered duplicate request: <GET http://quotes.toscrape.com/author/Albert-Einstein>
Scrapy自动管理重复项,并尝试不两次访问一个URL(出于明显的原因)。
在这种情况下,您可以将dont_filter = True
添加到您的请求中,并会看到类似这样的内容:
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Steve-Martin/> (referer: http://quotes.toscrape.com/author/Steve-Martin/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Albert-Einstein/> (referer: http://quotes.toscrape.com/author/Albert-Einstein/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Marilyn-Monroe/> (referer: http://quotes.toscrape.com/author/Marilyn-Monroe/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/J-K-Rowling/> (referer: http://quotes.toscrape.com/author/J-K-Rowling/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Eleanor-Roosevelt/> (referer: http://quotes.toscrape.com/author/Eleanor-Roosevelt/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Andre-Gide/> (referer: http://quotes.toscrape.com/author/Andre-Gide/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Thomas-A-Edison/> (referer: http://quotes.toscrape.com/author/Thomas-A-Edison/)
2019-07-15 19:33:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Jane-Austen/> (referer: http://quotes.toscrape.com/author/Jane-Austen/)
这确实有点奇怪,因为页面对自身产生了请求。
总的来说,您可能会得到如下结果:
import scrapy
class QuoteSpider(scrapy.Spider):
name = 'quote'
baseurl = 'http://quotes.toscrape.com'
start_urls = [baseurl]
def parse(self, response):
all_div_quotes = response.css('.quote')
for quote in all_div_quotes:
item = dict()
title = quote.css('.text::text').extract()
author = quote.css('.author::text').extract()
tags = quote.css('.tag::text').extract()
author_details_url = self.baseurl + quote.css('.author+ a::attr(href)').extract_first()
item['title'] = title
item['author'] = author
item['tags'] = tags
print(item)
# Don't filter = True in case of we get two quotes of a single author.
# This is not optimal though. Better decision will be to save author data to self.storage
# And only visit new author info pages if needed, else take info from saved dict.
request = scrapy.Request(author_details_url,
callback=self.author_info,
meta={'item': item},
dont_filter=True)
yield request
def author_info(self, response):
item = response.meta['item']
author_born = response.css('.author-born-date::text').extract()
author_birthplace = response.css('.author-born-location::text').extract()
item['author_born'] = author_born
item['author_birthplace'] = author_birthplace
yield item