如何用scrapy遍历下一页?

时间:2017-03-04 19:28:32

标签: python scrapy web-crawler scrapy-spider

您好,感谢您花时间帮助我们。

问题:我试图链接蜘蛛以继续遍历下一页 但它没有用,所以我希望得到一些关于我做错的指示。

class infoSpider(scrapy.Spider):
name = 'info_spider'
start_urls = ['https://www.youtube.com/results?search_query=cars']

def parse(self, response):
    SET_SELECTOR = '.yt-lockup'
    for content in response.css(SET_SELECTOR):

        NAME_SELECTOR = '.yt-lockup-byline a ::text'
        IMAGE_SELECTOR = 'img ::attr(src)'
        yield {
            'name': content.css(NAME_SELECTOR).extract_first(),
            'image': content.css(IMAGE_SELECTOR).extract_first(),
        }

    NEXT_PAGE_SELECTOR = '.yt-uix-button-content a ::attr(href)'
    next_page = response.css(NEXT_PAGE_SELECTOR).extract_first()
    if next_page:
        yield scrapy.Request(
            response.urljoin(next_page),
            callback=self.parse
        )

更新:它返回一些结果,但不会继续遍历到下一页。我找到它为每个下一个按钮提供一个随机键 我需要找到解决办法。

如果您需要更多信息,请告知我们(请不要请点击!)

提前致谢。

3 个答案:

答案 0 :(得分:2)

您将要使用Link Extractor。您可以使用规则集指定下一页链接。这是官方文档 https://doc.scrapy.org/en/latest/topics/link-extractors.html

class infoSpider(scrapy.Spider):
name = 'info_spider'
start_urls = ['https://www.youtube.com/results?search_query=cars']

rules = (
    Rule(LinkExtractor(allow=(), restrict_css=('.yt-uix-button-content a ::attr(href)')), callback="parse_page", follow=True),
)

def parse_page(self, response):
    SET_SELECTOR = '.yt-lockup'
    for content in response.css(SET_SELECTOR):

        NAME_SELECTOR = '.yt-lockup-byline a ::text'
        IMAGE_SELECTOR = 'img ::attr(src)'
        yield {
            'name': content.css(NAME_SELECTOR).extract_first(),
            'image': content.css(IMAGE_SELECTOR).extract_first(),
        }

答案 1 :(得分:0)

更改

NEXT_PAGE_SELECTOR = '.yt-uix-button-content a ::attr(href)'

NEXT_PAGE_SELECTOR = '.yt-uix-button-content a::attr(href)'

或者将最后一行代码更改为

try:
    next_page = response.css('.yt-uix-button-content a::attr(href)').extract()[0]

    yield scrapy.Request(
        response.urljoin(next_page),
        callback=self.parse
    )
except IndexError:
     pass

答案 2 :(得分:0)

我建议你使用 RULES

rules = (
         Rule(
              LinkExtractor(
                            restrict_xpaths='//*[contains(@class, "yt-uix-button-content")]/a'),
              callback='self.parse'),

         )

另一个建议,不要覆盖解析功能。