我从阿里巴巴的单个页面抓取数据的代码是这样的:
# -*- coding: utf-8 -*-
import scrapy
class AlibotSpider(scrapy.Spider):
name = 'alibot'
allowed_domains = ['alibaba.com']
start_urls = ['https://www.alibaba.com/showroom/acrylic-wine-box.html']
def parse(self, response):
Title = response.xpath('//*[@class="title three-line"]/a/@title').extract()
Price = response.xpath('//div[@class="price"]/b/text()').extract()
Min_order = response.xpath('//div[@class="min-order"]/b/text()').extract()
Response_rate = response.xpath('//i[@class="ui2-icon ui2-icon-skip"]/text()').extract()
for item in zip(Title,Price,Min_order,Response_rate):
scraped_info = {
'Title':item[0],
'Price': item[1],
'Min_order':item[2],
'Response_rate':item[3]
}
yield scraped_info
我想从所有页面抓取数据,我该怎么做,这是单击下一页时是javascript操作。我有多个链接,而不仅仅是这个。所以我想要某种方式可以遍历下一页直到最后一页并从中废弃数据。
html片段就是这个:
<div class="ui2-pagination-pages">
<span class="prev disable">Prev</span>
<span class="current">1</span>
<a rel="nofollow" href="//www.alibaba.com/showroom/acrylic-wine-box_2.html">2</a>
<a rel="nofollow" href="//www.alibaba.com/showroom/acrylic-wine-box_3.html">3</a>
<a rel="nofollow" href="//www.alibaba.com/showroom/acrylic-wine-box_4.html">4</a>
<a rel="nofollow" href="//www.alibaba.com/showroom/acrylic-wine-box_5.html">5</a>
<a rel="nofollow" href="//www.alibaba.com/showroom/acrylic-wine-box_6.html">6</a>
<a rel="nofollow" href="//www.alibaba.com/showroom/acrylic-wine-box_7.html">7</a>
<span class="interim">...</span>
<a rel="nofollow" href="//www.alibaba.com/showroom/acrylic-wine-box_103.html">103</a>
<a href="javascript:void(0)" class="next" data-role="next">Next</a>
</div>
答案 0 :(得分:1)
所有地址都有模式https://www.alibaba.com/showroom/acrylic-wine-box_(page).html
,您可以修改page
字段以刮取不同的页面。
# OUTPUT:
# 2018-08-19 11:28:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.alibaba.com/showroom/acrylic-wine-box_2.html> (referer: None)
# 2018-08-19 11:28:33 [alibot] INFO: counter clear custom plexiglass display box for wine red wine champagne display acrylic box - US $4-50
# 2018-08-19 11:28:34 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.alibaba.com/showroom/acrylic-wine-box_1.html> (referer: None)
# 2018-08-19 11:28:34 [alibot] INFO: wine glass acrylic whisky beverage led bottle display rack box stands - US $16.88-26.88
# -*- coding: utf-8 -*-
import scrapy
class AlibotSpider(scrapy.Spider):
name = 'alibot'
allowed_domains = ['alibaba.com']
start_urls = [
"https://www.alibaba.com/showroom/acrylic-wine-box_" + str(x) + ".html"
for x in range(1, 3) # pages to scrape
]
def parse(self, response):
Title = response.xpath('//*[@class="title three-line"]/a/@title').extract()
Price = response.xpath('//div[@class="price"]/b/text()').extract()
Min_order = response.xpath('//div[@class="min-order"]/b/text()').extract()
Response_rate = response.xpath('//i[@class="ui2-icon ui2-icon-skip"]/text()').extract()
self.logger.info(Title[0] + " - " + Price[0])