Scrapy正在跳过分页的第一页

时间:2015-11-20 19:20:04

标签: pagination scrapy scrapy-spider

以下Scrapy CrawlSpider类代码用于通过data.ok.gov页面中的以下分页来抓取链接。

class OklahomaFinanceSpider(CrawlSpider):
    name = "OklahomaFinanceSpider"
    allowed_domains = ["data.ok.gov"]
    start_urls = [
        "http://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"
        ] 

    rules = (
    Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//li[@class="pager-next"]',)), callback="parse_page", follow= True),
) 
def parse_page(self, response): 

        for href in response.xpath('//*[contains(concat(" ", normalize-space(@class), " "),"search-results apachesolr_search-results")]/h3/a/@href'):
        url = response.urljoin(href.extract())
        yield scrapy.Request(url, callback=self.parse_dir_contents)   

但是,第一页没有被删除。我对规则犯了什么错误?

0 个答案:

没有答案