Question

我正在使用Scrapy 1.4.0抓取这个网站：https://www.olx.com.ar/celulares-telefonos-cat-831。当我运行蜘蛛时，一切顺利，直到它到达“下一页”部分。这是代码：

  # -*- coding: utf-8 -*-
import scrapy
#import time

class OlxarSpider(scrapy.Spider):
name = "olxar"
allowed_domains = ["olx.com.ar"]
start_urls = ['https://www.olx.com.ar/celulares-telefonos-cat-831']

def parse(self, response):
    #time.sleep(10)
    response = response.replace(body=response.body.replace('<br>', '')) 
    SET_SELECTOR = '.item'
    for item in response.css(SET_SELECTOR):
        PRODUCTO_SELECTOR = '.items-info h3 ::text'
        yield {
            'producto': item.css(PRODUCTO_SELECTOR).extract_first().replace(',',' '),
            }

    NEXT_PAGE_SELECTOR = '.items-paginations-buttons a::attr(href)'
    next_page = response.css(NEXT_PAGE_SELECTOR).extract_first().replace('//','https://')
    if next_page:
        yield scrapy.Request(response.urljoin(next_page),
            callback=self.parse
            )

我在其他问题中看到有些人将dont_filter = True属性添加到Request，但这对我不起作用。它只是使蜘蛛循环超过前2页。我添加了replace('//','https://')部分来修复没有https:的原始href，并且Scrapy不能跟随它。此外，当我运行蜘蛛时，它会废弃第一页，然后返回[scrapy.dupefilters] DEBUG: Filtered duplicate request: <GET https://www.olx.com.ar/celulares-telefonos-cat-831-p-2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates) 为什么它显然不是过滤第二页像重复一样？

我在评论中应用了Tarun Lalwani解决方案。我错过了那么糟糕的细节！纠正它的工作正常，谢谢！

Answer 1

你的问题是css选择器。在第1页，它匹配下一页链接。在第2页，它匹配上一页和下一页链接。除此之外，您使用extract_first()选择第一个，因此您只需在第一页和第二页之间旋转

解决方法很简单，您需要更改css选择器

NEXT_PAGE_SELECTOR = '.items-paginations-buttons a::attr(href)'

到

NEXT_PAGE_SELECTOR = '.items-paginations-buttons a.next::attr(href)'

这只会识别下一页网址

Scrapy没有关注下一页的网址，为什么？

1 个答案: