我的小题大作的代码是:
import scrapy
class DummymartSpider(scrapy.Spider):
name = 'dummymart'
allowed_domains = ['www.dummymart.com/product']
start_urls = ['https://www.dummymart.net/product/auto-parts--118']
def parse(self, response):
Company = response.xpath('//*[@class="word-wrap item-title"]/text()').extract()
for item in zip(Company):
scraped_info = {
'Company':item[0],
}
yield scraped_info
next_page_url = response.css('li >a::attr(href)').extract_first()
#next_page_url = response.urljoin(next_page_url)
if next_page_url:
yield scrapy.Request(url = next_page_url, callback = self.parse)
分页链接具有以下html语法:
<ul class="pagination">
<li class="active"><a href="#">1 <span class="sr-only">(current)</span></a></li>
<li><a href="https://www.dummy.net/product/auto-parts--118?page=2">2</a></li>
<li><a href="https://www.dummy.net/product/auto-parts--118?page=3">3</a></li>
<li><a href="https://www.dummy.net/product/auto-parts--118?page=4">4</a></li>
<li><a href="https://www.dummy.net/product/auto-parts--118?page=5">5</a></li>
<li><a href="https://www.dummy.net/product/auto-parts--118?page=6">6</a></li>
<li><a href="https://www.dummy.net/product/auto-parts--118?page=7">7</a></li>
<li><a href="https://www.dummy.net/product/auto-parts--118?page=8">8</a></li>
<li><a href="https://www.dummy.net/product/auto-parts--118?page=9">9</a></li>
<li><a href="https://www.dummy.net/product/auto-parts--118?page=10">10</a></li>
<li class="disabled"><span>...</span></li>
<li><a href="https://www.dummy.net/product/auto-parts--118?page=2" aria-label="Next"><span aria-hidden="true">»</span></a></li>
</ul>
问题在于,它仅会抓取第一个分页链接而不是其他链接。 我也该如何通过这两个分页链接进行报废?谢谢。
第二页处于活动状态时的HTML选择器:
<ul class="pagination">
<li><a href="https://www.dummy.net/products/new?page=1" aria-label="Prev"><span aria-hidden="true">«</span></a></li>
<li><a href="https://www.dummy.net/products/new?page=1">1</a></li>
<li class="active"><a href="#">2 <span class="sr-only">(current)</span></a></li>
<li><a href="https://www.dummy.net/products/new?page=3">3</a></li>
<li><a href="https://www.dummy.net/products/new?page=4">4</a></li>
<li><a href="https://www.dummy.net/products/new?page=5">5</a></li>
<li><a href="https://www.dummy.net/products/new?page=6">6</a></li>
<li><a href="https://www.dummy.net/products/new?page=7">7</a></li>
<li><a href="https://www.dummy.net/products/new?page=8">8</a></li>
<li><a href="https://www.dummy.net/products/new?page=9">9</a></li>
<li><a href="https://www.dummy.net/products/new?page=10">10</a></li>
<li class="disabled"><span>...</span></li>
<li><a href="https://www.dummy.net/products/new?page=3" aria-label="Next"><span aria-hidden="true">»</span></a></li>
</ul>
答案 0 :(得分:0)
您可以尝试这种方法(我正在尝试在当前页面后找到链接):
next_page_url = response.xpath('//li[ ./a[@class="curr"] ]/following-sibling::li[1]/a/@href').extract_first()
#next_page_url = response.urljoin(next_page_url)
if next_page_url:
yield scrapy.Request(url = next_page_url, callback = self.parse)
更新 根据您的新HTML,您需要以下代码:
next_page_url = response.xpath('//li/a[@aria-label="Next"]/@href').extract_first()
if next_page_url:
yield scrapy.Request(url = next_page_url, callback = self.parse)