scrapy代码(取自一篇博文),只能从第一页废弃数据。我添加了“规则”以从第二页提取数据,但它仍然仅从第一页获取数据。
有任何建议吗?
以下是代码:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import TfawItem
class MasseffectSpider(CrawlSpider):
name = "massEffect"
allowed_domains = ["tfaw.com"]
start_urls = [
'http://www.tfaw.com/Companies/Dark-Horse/Series/?series_name=Adventure-Time',
]
rules = (
Rule(LinkExtractor(allow=(),
restrict_xpaths=('//div[@class="small-corners-light"][1]/table/tbody/tr[1]/td[2]/a[@class="regularlink"]',)),
callback='parse', follow=True),
)
def parse(self, response):
for href in response.xpath('//a[@class="regularlinksmallbold product-profile-link"]/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_detail_page)
pass
def parse_detail_page(self, response):
comic = TfawItem()
comic['title'] = response.xpath('//td/div[1]/b/span[@class="blackheader"]/text()').extract()
comic['price'] = response.xpath('//span[@class="redheader"]/text()').extract()
comic['upc'] = response.xpath('//td[@class="xh-highlight"]/text()').extract()
comic['url'] = response.url
yield comic
答案 0 :(得分:0)
你的蜘蛛在这里几乎没有问题。首先,根据文档:
覆盖crawlspider保留的parse()
方法
编写爬网蜘蛛规则时,请避免使用parse作为回调 CrawlSpider使用parse方法本身来实现其逻辑。 因此,如果您覆盖解析方法,则爬行蜘蛛将不再存在 工作
现在第二个问题是你的LinkExtractor什么都没提取。你的xpath在这里什么都不做。
我建议不要使用CrawlSpider,只需使用base scrapy.Spider:
import scrapy
class MySpider(scrapy.Spider):
name = 'massEffect'
start_urls = [
'http://www.tfaw.com/Companies/Dark-Horse/Series/?series_name=Adventure-Time',
]
def parse(self, response):
# parse all items
for href in response.xpath('//a[@class="regularlinksmallbold product-profile-link"]/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_detail_page)
# do next page
next_page = response.xpath("//a[contains(text(),'next page')]/@href").extract_first()
if next_page:
yield Request(response.urljoin(next_page), callback=self.parse)
def parse_detail_page(self, response):
comic = dict()
comic['title'] = response.xpath('//td/div[1]/b/span[@class="blackheader"]/text()').extract()
comic['price'] = response.xpath('//span[@class="redheader"]/text()').extract()
comic['upc'] = response.xpath('//td[@class="xh-highlight"]/text()').extract()
comic['url'] = response.url
yield comic