Scrapy - 从第一页搜索数据,而不是从分页中的“下一页”搜索

时间:2016-08-10 09:20:37

标签: python-2.7 scrapy web-crawler scrapy-spider

scrapy代码(取自一篇博文),只能从第一页废弃数据。我添加了“规则”以从第二页提取数据,但它仍然仅从第一页获取数据。

有任何建议吗?

以下是代码:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import TfawItem


class MasseffectSpider(CrawlSpider):
    name = "massEffect"
    allowed_domains = ["tfaw.com"]
    start_urls = [
        'http://www.tfaw.com/Companies/Dark-Horse/Series/?series_name=Adventure-Time',
    ]

    rules = (
        Rule(LinkExtractor(allow=(),
                           restrict_xpaths=('//div[@class="small-corners-light"][1]/table/tbody/tr[1]/td[2]/a[@class="regularlink"]',)),
             callback='parse', follow=True),
    )

    def parse(self, response):
        for href in response.xpath('//a[@class="regularlinksmallbold product-profile-link"]/@href'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_detail_page)
        pass

    def parse_detail_page(self, response):
        comic = TfawItem()
        comic['title'] = response.xpath('//td/div[1]/b/span[@class="blackheader"]/text()').extract()
        comic['price'] = response.xpath('//span[@class="redheader"]/text()').extract()
        comic['upc'] = response.xpath('//td[@class="xh-highlight"]/text()').extract()
        comic['url'] = response.url
        yield comic

1 个答案:

答案 0 :(得分:0)

你的蜘蛛在这里几乎没有问题。首先,根据文档:

覆盖crawlspider保留的parse()方法
  

编写爬网蜘蛛规则时,请避免使用parse作为回调   CrawlSpider使用parse方法本身来实现其逻辑。   因此,如果您覆盖解析方法,则爬行蜘蛛将不再存在   工作

现在第二个问题是你的LinkExtractor什么都没提取。你的xpath在这里什么都不做。

我建议不要使用CrawlSpider,只需使用base scrapy.Spider:

import scrapy
class MySpider(scrapy.Spider):
    name = 'massEffect'
    start_urls = [
        'http://www.tfaw.com/Companies/Dark-Horse/Series/?series_name=Adventure-Time',
    ]

    def parse(self, response):
        # parse all items
        for href in response.xpath('//a[@class="regularlinksmallbold product-profile-link"]/@href'):
            url = response.urljoin(href.extract())
            yield scrapy.Request(url, callback=self.parse_detail_page)
        # do next page
        next_page = response.xpath("//a[contains(text(),'next page')]/@href").extract_first()
        if next_page:
            yield Request(response.urljoin(next_page), callback=self.parse)

    def parse_detail_page(self, response):
        comic = dict()
        comic['title'] = response.xpath('//td/div[1]/b/span[@class="blackheader"]/text()').extract()
        comic['price'] = response.xpath('//span[@class="redheader"]/text()').extract()
        comic['upc'] = response.xpath('//td[@class="xh-highlight"]/text()').extract()
        comic['url'] = response.url
        yield comic