Question

您好，感谢您花时间帮助我们。

问题：我试图链接蜘蛛以继续遍历下一页但它没有用，所以我希望得到一些关于我做错的指示。

class infoSpider(scrapy.Spider):
name = 'info_spider'
start_urls = ['https://www.youtube.com/results?search_query=cars']

def parse(self, response):
    SET_SELECTOR = '.yt-lockup'
    for content in response.css(SET_SELECTOR):

        NAME_SELECTOR = '.yt-lockup-byline a ::text'
        IMAGE_SELECTOR = 'img ::attr(src)'
        yield {
            'name': content.css(NAME_SELECTOR).extract_first(),
            'image': content.css(IMAGE_SELECTOR).extract_first(),
        }

    NEXT_PAGE_SELECTOR = '.yt-uix-button-content a ::attr(href)'
    next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
    if next_page:
        yield scrapy.Request(
            response.urljoin(next_page),
            callback=self.parse
        )

更新：它返回一些结果，但不会继续遍历到下一页。我找到它为每个下一个按钮提供一个随机键我需要找到解决办法。

如果您需要更多信息，请告知我们（请不要请点击！）

提前致谢。

Answer 1

您将要使用Link Extractor。您可以使用规则集指定下一页链接。这是官方文档 https://doc.scrapy.org/en/latest/topics/link-extractors.html

class infoSpider(scrapy.Spider):
name = 'info_spider'
start_urls = ['https://www.youtube.com/results?search_query=cars']

rules = (
    Rule(LinkExtractor(allow=(), restrict_css=('.yt-uix-button-content a ::attr(href)')), callback="parse_page", follow=True),
)

def parse_page(self, response):
    SET_SELECTOR = '.yt-lockup'
    for content in response.css(SET_SELECTOR):

        NAME_SELECTOR = '.yt-lockup-byline a ::text'
        IMAGE_SELECTOR = 'img ::attr(src)'
        yield {
            'name': content.css(NAME_SELECTOR).extract_first(),
            'image': content.css(IMAGE_SELECTOR).extract_first(),
        }

Answer 2

更改

NEXT_PAGE_SELECTOR = '.yt-uix-button-content a ::attr(href)'

到

NEXT_PAGE_SELECTOR = '.yt-uix-button-content a::attr(href)'

或者将最后一行代码更改为

try:
    next_page = response.css('.yt-uix-button-content a::attr(href)').extract()[0]

    yield scrapy.Request(
        response.urljoin(next_page),
        callback=self.parse
    )
except IndexError:
     pass

Answer 3

我建议你使用 RULES

rules = (
         Rule(
              LinkExtractor(
                            restrict_xpaths='//*[contains(@class, "yt-uix-button-content")]/a'),
              callback='self.parse'),

         )

另一个建议，不要覆盖解析功能。

如何用scrapy遍历下一页？

3 个答案: