Question

我正在尝试让一个抓抓的蜘蛛抓取档案中的多个页面，目的是打开每个单独的链接并抓取链接页面的内容。我遇到了一些随机的HTTP 500错误，我试图通过简单地尝试-跳过跳过那些返回500错误的页面来跳过这些错误。

parse函数的第一部分遍历存档页面中的href，以便使用parse_art函数抓取页面。第二部分是在存档中找到下一页，并继续浏览该页面以继续爬网。

我正在尝试更改程序以迭代初始URL，但似乎无法正确执行。任何帮助，将不胜感激。

在Python 3.7上运行scrapy。

import scrapy
url_number = 1

class SpiderOne(scrapy.Spider):
    name = 'spider1'
    start_urls = ["http://www.page2bscraped.com/archive?page=%d" % url_number]

    #Parses over the archive page
    def parse(self, response):
        global url_number
        for href in response.xpath(".//a/@href"):
            yield response.follow(href, self.parse_art)

        for href in response.xpath(start_url):
            yield response.follow(start_url, self.parse)
            url_number += 1

    #Parses page contents                              
    def parse_art(self, response):
    #code goes here

我正在努力使蜘蛛通过使用url并在当前档案号中加1，而不是依靠（不可靠的）“下一页” Xpath来爬行整个档案。

Answer 1

更改值url_number不能更改url中的值。

您必须再次生成完整网址

def parse(self, response):
    global url_number

    for href in response.xpath(".//a/@href"):
        yield response.follow(href, self.parse_art)

    url_number += 1
    url = "http://www.page2bscraped.com/archive?page=%d" % url_number

    yield response.follow(url, self.parse)

如何使用URL匆忙遍历存档？

1 个答案: