如何在阿里巴巴抓取下一页链接

时间:2018-08-19 02:12:53

标签: python xpath web-scraping scrapy

我从阿里巴巴的单个页面抓取数据的代码是这样的:

# -*- coding: utf-8 -*-
import scrapy




class AlibotSpider(scrapy.Spider):
    name = 'alibot'
    allowed_domains = ['alibaba.com']
    start_urls = ['https://www.alibaba.com/showroom/acrylic-wine-box.html']



    def parse(self, response):
        Title = response.xpath('//*[@class="title three-line"]/a/@title').extract()
        Price = response.xpath('//div[@class="price"]/b/text()').extract()
        Min_order = response.xpath('//div[@class="min-order"]/b/text()').extract()
        Response_rate = response.xpath('//i[@class="ui2-icon ui2-icon-skip"]/text()').extract()

        for item in zip(Title,Price,Min_order,Response_rate):
            scraped_info = {
                'Title':item[0],
                'Price': item[1],
                'Min_order':item[2],
                'Response_rate':item[3]

            }
            yield scraped_info

我想从所有页面抓取数据,我该怎么做,这是单击下一页时是javascript操作。我有多个链接,而不仅仅是这个。所以我想要某种方式可以遍历下一页直到最后一页并从中废弃数据。

html片段就是这个:

<div class="ui2-pagination-pages">
<span class="prev disable">Prev</span>
 <span class="current">1</span>
  <a rel="nofollow" href="//www.alibaba.com/showroom/acrylic-wine-box_2.html">2</a>
  <a rel="nofollow" href="//www.alibaba.com/showroom/acrylic-wine-box_3.html">3</a>
  <a rel="nofollow" href="//www.alibaba.com/showroom/acrylic-wine-box_4.html">4</a>
 <a rel="nofollow" href="//www.alibaba.com/showroom/acrylic-wine-box_5.html">5</a>

   <a rel="nofollow" href="//www.alibaba.com/showroom/acrylic-wine-box_6.html">6</a>
 <a rel="nofollow" href="//www.alibaba.com/showroom/acrylic-wine-box_7.html">7</a>


                         <span class="interim">...</span>


                        <a rel="nofollow" href="//www.alibaba.com/showroom/acrylic-wine-box_103.html">103</a>




                                    <a href="javascript:void(0)" class="next" data-role="next">Next</a>




    </div>

1 个答案:

答案 0 :(得分:1)

所有地址都有模式https://www.alibaba.com/showroom/acrylic-wine-box_(page).html,您可以修改page字段以刮取不同的页面。

# OUTPUT:
# 2018-08-19 11:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.alibaba.com/showroom/acrylic-wine-box_2.html> (referer: None)
# 2018-08-19 11:28:33 [alibot] INFO: counter clear custom plexiglass display box for wine red wine champagne display acrylic box -  US $4-50
# 2018-08-19 11:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.alibaba.com/showroom/acrylic-wine-box_1.html> (referer: None)
# 2018-08-19 11:28:34 [alibot] INFO: wine glass acrylic whisky beverage led bottle display rack box stands -  US $16.88-26.88

# -*- coding: utf-8 -*-
import scrapy

class AlibotSpider(scrapy.Spider):
  name = 'alibot'
  allowed_domains = ['alibaba.com']
  start_urls = [ 
    "https://www.alibaba.com/showroom/acrylic-wine-box_" + str(x) + ".html"
    for x in range(1, 3) # pages to scrape
  ]

  def parse(self, response):
    Title = response.xpath('//*[@class="title three-line"]/a/@title').extract()
    Price = response.xpath('//div[@class="price"]/b/text()').extract()
    Min_order = response.xpath('//div[@class="min-order"]/b/text()').extract()
    Response_rate = response.xpath('//i[@class="ui2-icon ui2-icon-skip"]/text()').extract()

    self.logger.info(Title[0] + " - " + Price[0])