我在对此网站进行分页后遇到问题:http://gamesurf.tiscali.it/ps4/recensioni.html
我的蜘蛛部分代码:
for pag in response.css('li.square-nav'):
next = pag.css('li.square-nav > a > span::text').extract_first()
if next=='»':
next_page_url = pag.css('a::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
如果我在Windows终端上运行我的蜘蛛,它可以在网站的所有页面上运行,但是当我部署到scrapinghub并从仪表板中的按钮运行时,蜘蛛只会抓取网站的第一页。 在日志消息之间有一个警告:
[py.warnings] /app/__main__.egg/reccy/spiders/reccygsall.py:21:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to
Unicode - interpreting them as being unequal.
第21行是这样的:
if next=='»':
我检查过问题不是由robot.txt引起的。 我怎样才能解决这个问题? 感谢
整个蜘蛛:
# -*- coding: utf-8 -*-
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'reccygsall'
allowed_domains = ['gamesurf.tiscali.it']
start_urls = ['http://gamesurf.tiscali.it/ps4/recensioni.html']
def parse(self, response):
for quote in response.css("div.boxn1"):
item = {
'title': quote.css('div.content.fulllayer > h3 > a::text').extract_first(),
'text': quote.css('div.content.fulllayer > h3 > a::attr(href)').extract_first(),
}
yield item
for pag in response.css('li.square-nav'):
next = pag.css('li.square-nav > a > span::text').extract_first()
if next=='»':
next_page_url = pag.css('a::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)
答案 0 :(得分:0)
我找到了解决方案:
# -*- coding: utf-8 -*-
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'reccygsall'
allowed_domains = ['gamesurf.tiscali.it']
start_urls = ['http://gamesurf.tiscali.it/ps4/recensioni.html']
contatore = 0
def parse(self, response):
for quote in response.css("div.boxn1"):
item = {
'title': quote.css('div.content.fulllayer > h3 > a::text').extract_first(),
'text': quote.css('div.content.fulllayer > h3 > a::attr(href)').extract_first(),
}
yield item
self.contatore = self.contatore + 1
a = 0
for pag in response.css('li.square-nav'):
next = pag.css('a::text').extract_first()
if next is None:
a = a+1;
if (self.contatore < 2) or (a > 1):
next_page_url = pag.css('a::attr(href)').extract_first()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(url=next_page_url, callback=self.parse)