应用错误收集

以下Scrapy CrawlSpider类代码用于通过data.ok.gov页面中的以下分页来抓取链接。

class OklahomaFinanceSpider(CrawlSpider):
    name = "OklahomaFinanceSpider"
    allowed_domains = ["data.ok.gov"]
    start_urls = [
        "http://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"
        ] 

    rules = (
    Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//li[@class="pager-next"]',)), callback="parse_page", follow= True),
) 
def parse_page(self, response): 

        for href in response.xpath('//*[contains(concat(" ", normalize-space(@class), " "),"search-results apachesolr_search-results")]/h3/a/@href'):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_dir_contents)

但是，第一页没有被删除。我对规则犯了什么错误？

Scrapy正在跳过分页的第一页

0 个答案: