我正在努力学习Scrapy。
# -*- coding: utf-8 -*-
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com/']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.xpath('//*[@class="quote"]')
for quote in quotes:
text = quote.xpath(".//*[@class='text']/text()").extract_first()
author = quote.xpath("//*[@itemprop='author']/text()").extract_first()
tags = quote.xpath(".//*[@class='tag']/text()").extract();
item = {
'author_name':author,
'text':text,
'tags':tags
}
yield item
next_page_url = response.xpath("//*[@class='next']/a/@href").extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=absolute_next_page_url,callback=self.parse)
但scrapy只解析第一页。这段代码有什么问题。我从youtube教程中复制了它。
请帮忙。
答案 0 :(得分:3)
只是除了第一个请求之外的所有请求都被过滤为“非现场”。这是因为您在/
值的末尾有额外的allowed_domains
:
allowed_domains = ['quotes.toscrape.com/']
# REMOVE THIS SLASH^
答案 1 :(得分:0)
删除或注释掉allowed_domains。 (可选)删除分号#15。 此外,将以下代码缩进parse方法中:
next_page_url = response.xpath("//*[@class='next']/a/@href").extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=absolute_next_page_url,callback=self.parse)
因此它将变成以下代码:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
#allowed_domains = ['quotes.toscrape.com/']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.xpath('//*[@class="quote"]')
for quote in quotes:
text = quote.xpath(".//*[@class='text']/text()").extract_first()
author = quote.xpath("//*[@itemprop='author']/text()").extract_first()
tags = quote.xpath(".//*[@class='tag']/text()").extract()
item = {
'author_name':author,
'text':text,
'tags':tags
}
yield item
next_page_url = response.xpath("//*[@class='next']/a/@href").extract_first()
absolute_next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=absolute_next_page_url,callback=self.parse)