Scrapy没有关注下一页Python的链接

时间:2017-06-20 11:30:37

标签: python scrapy scrapy-spider

我正在尝试抓取Booking.Com获取标签,评论等信息。蜘蛛抓取我想要的信息但是当我添加代码来抓取下一页时,它甚至没有抓取第一页。我我熟悉链接提取器,但我不知道如何提供参数的值。例如,Idk我如何选择提供参数的参数,即allow,restrict。请告诉我代码有什么问题以及如何放置参数。将非常感谢帮助。谢谢!在期待中。这是代码

import scrapy
import urllib
from scrapy.loader import ItemLoader
from quo.items import QuoItem

class MySpider(scrapy.Spider):
    name = 'quotes1'
    allowed_domains=['booking.com']


    def start_requests(self):
        yield scrapy.Request('https://www.booking.com/hotel/us/jolly-madison-towers.html?label=gen173nr-1DCAEoggJCAlhYSDNiBW5vcmVmaLUBiAEBmAExwgEKd2luZG93cyAxMMgBDNgBA-gBAfgBApICAXmoAgM;sid=a4397d59763c8a4caec92e6242e50b46;checkin=2017-07-05;checkout=2017-07-06;ucfs=1;soh=1;highlighted_blocks=;all_sr_blocks=;room1=A%2CA;soldout=0%2C0;hpos=2;dest_type=city;dest_id=20088325;srfid=4cc7afd646ecfd94fcc083a01cd786310c221d4eX2;from=searchresults;highlight_room=#hotelTmpl', self.parse)
    rules = (

    )

    def parse(self, response):
         reviewsurl = response.xpath('//a[@class="show_all_reviews_btn"]/@href')
         url = response.urljoin(reviewsurl[0].extract())
         self.pageNumber=1
         yield scrapy.Request(url, callback=self.parse_content)


    def parse_content(self, response):
        item=QuoItem()
        user_rating=response.xpath('//span[@itemprop="reviewRating"]/meta[@itemprop="ratingValue"]/@content').extract()
        if user_rating:
                item['User_Rating']=user_rating
        title=response.xpath('//*[@class="review_item_header_content_container"]/a/span[@itemprop="name"]').extract()
        if title:
           item['Title']=title
        rev_positive=response.xpath('//p[@class="review_pos"]/span[@itemprop="reviewBody"]/text()').extract()
        if rev_positive:
                item['rev_positive']=rev_positive
        rev_negative=response.xpath('//p[@class="review_neg"]/span[@itemprop="reviewBody"]/text()').extract()
        if rev_positive:
                item['rev_negative']=rev_negative
                return item

        next_page = response.xpath('//a[@id="review_next_page_link"]/@href').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse_content)

这是我运行此蜘蛛Output when I add next_page code时的输出 正如您在日志中看到的那样,蜘蛛会进入审阅页面,但不会抓取任何内容。这是删除next_page代码时的输出。它会进入审核页面并按照说明进行操作。Output when I remove the next_page

0 个答案:

没有答案