获得关注链接scrapy网络爬虫的最佳方式

时间:2017-11-06 00:29:10

标签: python scrapy web-crawler

所以我正在尝试编写一个蜘蛛来继续点击网页上的next按钮,直到它不再存在(或者直到我添加一些逻辑使其停止)。下面的代码正确获取到下一页的链接,但只打印一次。我的问题是为什么它不是“跟随”每个下一个按钮导致的链接?

class MyprojectSpider(scrapy.Spider):
    name = 'redditbot'
    allowed_domains = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']
    start_urls = ['https://www.reddit.com/r/nfl/?count=25&after=t3_7ax8lb']

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        next_page = hxs.select('//div[@class="nav-buttons"]//a/@href').extract()
        if next_page:
            yield Request(next_page[1], self.parse)
            print(next_page[1])

1 个答案:

答案 0 :(得分:1)

要转到下一页,而不是打印链接,您只需要生成scrapy.Request object,如下面的代码:

import scrapy

class MyprojectSpider(scrapy.Spider):
    name = 'myproject'
    allowed_domains = ['reddit.com']
    start_urls = ['https://www.reddit.com/r/nfl/']

    def parse(self, response):
        posts = response.xpath('//div[@class="top-matter"]')
        for post in posts:
            # Get your data here
            title = post.xpath('p[@class="title"]/a/text()').extract()
            print(title)
            # Go to next page
            next_page = response.xpath('//span[@class="next-button"]/a/@href').extract_first()
            if next_page:
                 yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

更新:以前的代码错误,需要使用绝对网址,而且某些X路径错误,这个新的应该可以使用。

希望它有所帮助!