分页后的Scrapy python

时间:2017-05-30 12:37:52

标签: python pagination scrapy

我在对此网站进行分页后遇到问题:http://gamesurf.tiscali.it/ps4/recensioni.html

我的蜘蛛部分代码:

for pag in response.css('li.square-nav'):
    next = pag.css('li.square-nav > a > span::text').extract_first()
    if next=='»':
        next_page_url = pag.css('a::attr(href)').extract_first()
        if next_page_url:
            next_page_url = response.urljoin(next_page_url)
            yield scrapy.Request(url=next_page_url, callback=self.parse)

如果我在Windows终端上运行我的蜘蛛,它可以在网站的所有页面上运行,但是当我部署到scrapinghub并从仪表板中的按钮运行时,蜘蛛只会抓取网站的第一页。 在日志消息之间有一个警告:

[py.warnings] /app/__main__.egg/reccy/spiders/reccygsall.py:21: 
UnicodeWarning: Unicode equal comparison failed to convert both arguments to 
Unicode - interpreting them as being unequal.

第21行是这样的:

if next=='»':

我检查过问题不是由robot.txt引起的。 我怎样才能解决这个问题? 感谢

整个蜘蛛:

# -*- coding: utf-8 -*-
import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'reccygsall'
    allowed_domains = ['gamesurf.tiscali.it']
    start_urls = ['http://gamesurf.tiscali.it/ps4/recensioni.html']

def parse(self, response):
    for quote in response.css("div.boxn1"):
        item = {
            'title':  quote.css('div.content.fulllayer > h3 > a::text').extract_first(),
            'text':  quote.css('div.content.fulllayer > h3 > a::attr(href)').extract_first(),
        }
        yield item


    for pag in response.css('li.square-nav'):
        next = pag.css('li.square-nav > a > span::text').extract_first()
        if next=='»':
            next_page_url = pag.css('a::attr(href)').extract_first()
            if next_page_url:
                next_page_url = response.urljoin(next_page_url)
                yield scrapy.Request(url=next_page_url, callback=self.parse)

1 个答案:

答案 0 :(得分:0)

我找到了解决方案:

# -*- coding: utf-8 -*-

import scrapy


class QuotesSpider(scrapy.Spider):
    name = 'reccygsall'
    allowed_domains = ['gamesurf.tiscali.it']
    start_urls = ['http://gamesurf.tiscali.it/ps4/recensioni.html']

    contatore = 0

    def parse(self, response):
        for quote in response.css("div.boxn1"):
            item = {
                'title':  quote.css('div.content.fulllayer > h3 > a::text').extract_first(),
                'text':  quote.css('div.content.fulllayer > h3 > a::attr(href)').extract_first(),
            }
            yield item


            self.contatore = self.contatore + 1
            a = 0
            for pag in response.css('li.square-nav'):
                next = pag.css('a::text').extract_first()
                if next is None:
                    a = a+1;
                        if (self.contatore < 2) or (a > 1):
                            next_page_url = pag.css('a::attr(href)').extract_first()

                            if next_page_url:
                                next_page_url = response.urljoin(next_page_url)
                                yield scrapy.Request(url=next_page_url, callback=self.parse)