我正在尝试抓取Booking.Com获取标签,评论等信息。蜘蛛抓取我想要的信息但是当我添加代码来抓取下一页时,它甚至没有抓取第一页。我我熟悉链接提取器,但我不知道如何提供参数的值。例如,Idk我如何选择提供参数的参数,即allow,restrict。请告诉我代码有什么问题以及如何放置参数。将非常感谢帮助。谢谢!在期待中。这是代码
import scrapy
import urllib
from scrapy.loader import ItemLoader
from quo.items import QuoItem
class MySpider(scrapy.Spider):
name = 'quotes1'
allowed_domains=['booking.com']
def start_requests(self):
yield scrapy.Request('https://www.booking.com/hotel/us/jolly-madison-towers.html?label=gen173nr-1DCAEoggJCAlhYSDNiBW5vcmVmaLUBiAEBmAExwgEKd2luZG93cyAxMMgBDNgBA-gBAfgBApICAXmoAgM;sid=a4397d59763c8a4caec92e6242e50b46;checkin=2017-07-05;checkout=2017-07-06;ucfs=1;soh=1;highlighted_blocks=;all_sr_blocks=;room1=A%2CA;soldout=0%2C0;hpos=2;dest_type=city;dest_id=20088325;srfid=4cc7afd646ecfd94fcc083a01cd786310c221d4eX2;from=searchresults;highlight_room=#hotelTmpl', self.parse)
rules = (
)
def parse(self, response):
reviewsurl = response.xpath('//a[@class="show_all_reviews_btn"]/@href')
url = response.urljoin(reviewsurl[0].extract())
self.pageNumber=1
yield scrapy.Request(url, callback=self.parse_content)
def parse_content(self, response):
item=QuoItem()
user_rating=response.xpath('//span[@itemprop="reviewRating"]/meta[@itemprop="ratingValue"]/@content').extract()
if user_rating:
item['User_Rating']=user_rating
title=response.xpath('//*[@class="review_item_header_content_container"]/a/span[@itemprop="name"]').extract()
if title:
item['Title']=title
rev_positive=response.xpath('//p[@class="review_pos"]/span[@itemprop="reviewBody"]/text()').extract()
if rev_positive:
item['rev_positive']=rev_positive
rev_negative=response.xpath('//p[@class="review_neg"]/span[@itemprop="reviewBody"]/text()').extract()
if rev_positive:
item['rev_negative']=rev_negative
return item
next_page = response.xpath('//a[@id="review_next_page_link"]/@href').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse_content)
这是我运行此蜘蛛Output when I add next_page code时的输出 正如您在日志中看到的那样,蜘蛛会进入审阅页面,但不会抓取任何内容。这是删除next_page代码时的输出。它会进入审核页面并按照说明进行操作。Output when I remove the next_page