无法在scrapy和python中使用javascript检索下一页链接

时间:2018-10-09 04:07:39

标签: javascript python pagination scrapy

我在使用python获取下一页链接时遇到问题。

代码

import scrapy
from scrapy.http import Request
from gharbheti.items import GharbhetiItem
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader.processor import TakeFirst, Identity, MapCompose, Join, Compose
from urllib.parse import urljoin

    class ListSpider(scrapy.Spider):
    name = 'list'
    allowed_domains = ['gharbheti.com']
    start_urls = ['https://www.gharbheti.com/sale','https://www.gharbheti.com/rent']

        def parse(self, response):
            properties=response.xpath('//li[@class="col-md-6 Search_building"]/descendant::a')
            for property in properties:
                link=property.xpath('./@href').extract_first()
                urls=response.urljoin(link)

                yield Request(urls,callback=self.parse_property, meta={'URL':urls, })

        def parse_property(self, response):
                l = ItemLoader(item=GharbhetiItem(), response=response)
                URL=response.meta.get('URL')
                l.add_value('URL', response.url)
                l.add_xpath('Title','//div[@class="product-page-meta"]/h4/em/text()',MapCompose(str.strip,str.title))
                l.add_xpath('Offering','//figcaption[contains(text(), "For Sale")]/text()|//figcaption[contains(text(),"For Rent")]/text()',MapCompose(lambda i:i.replace('For',''),str.strip))
                l.add_xpath('Price','//div[@class="deal-pricebox"]/descendant::h3/text()',MapCompose(str.strip))
                l.add_xpath('Type','//ul[@class="suitable-for"]/li/text()',MapCompose(str.strip))
                bike_parking=response.xpath('//i[@class="fa fa-motorcycle"]/following-sibling::em/text()').extract_first()
                car_parking=response.xpath('//i[@class="fa fa-car"]/following-sibling::em/text()').extract_first()
                parking=("Bike Parking: {} Car Parking: {}".format(bike_parking,car_parking))
                l.add_value('Parking',parking)
                l.add_xpath('Description','//div[@class="comment more"]/text()',MapCompose(str.strip))
                l.add_xpath('Bedroom','//i[@class="fa fa-bed"]/following-sibling::text()',MapCompose(lambda i:i.replace('Total Bed Room:',''),str.strip,int))
                l.add_xpath('Livingroom','//i[@class="fa fa-inbox"]/following-sibling::text()',MapCompose(lambda i:i.replace('Total Living Room:',''),str.strip,int))
                l.add_xpath('Kitchen','//i[@class="fa fa-cutlery"]/following-sibling::text()',MapCompose(lambda i:i.replace('Total kitchen Room:',''),str.strip,int))
                l.add_xpath('Bathroom','//i[@class="fa fa-puzzle-piece"]/following-sibling::text()',MapCompose(lambda i:i.replace('Total Toilet/Bathroom:',''),str.strip,int))
                l.add_xpath('Address','//b[contains(text(), "Map")]/text()',MapCompose(lambda i:i.replace('Map Loaction :-',''),str.strip))
                l.add_xpath('Features','//div[@class="list main-list"]/ul/li/text()',MapCompose(str.strip))

                images=response.xpath('//div[@class="carousel-inner dtl-carousel-inner text-center"]/descendant::img').extract()
                images=[s.replace('<img src="', '') for s in images]
                images=[i.split('?')[0] for i in images]
                Image=["http://www.gharbheti.com" + im for im in images]
                l.add_value('Images',Image)

                return l.load_item()

无法从网络检索下一页     对于另一个网站,这就是我所做的(没有javascript的简单分页

next_page=response.urljoin(response.xpath('//a[contains(text(), "Next")]/@href').extract_first()

yield Request(next_page, callback=self.parse)

1 个答案:

答案 0 :(得分:1)

由于分页使用javascript,因此页面的源代码中没有链接。

看看发生了什么事

  1. 打开浏览器的检查器(Chrome中为F12),然后转到“网络”标签
  2. 点击网页界面上的“加载更多”按钮

检查器将向您显示该站点正在向https://www.gharbheti.com/RoomRentHome/GetPropertiesForRent发送异步POST表单请求,其中包含表单数据的两个值:

  1. RentTypeId:0 {不确定这是什么,但是我确定如果您需要知道的话就可以弄清楚}
  2. page:1 {每次点击“加载更多”时,获取的收入就会增加}

您必须采用使用scrapy的Form Request的程序化方法。看起来每个页面都会产生10个以上的属性,因此,如果您希望在初始页面加载后获得下一个1000个属性,则可以编写

for i in range(1,101):
    <send a form request with i as the page value>

我假设从POST返回的数据格式与网站主页不同,因此您可能必须定义另一个回调函数来解析该数据。