以下Scrapy CrawlSpider类代码用于通过data.ok.gov
页面中的以下分页来抓取链接。
class OklahomaFinanceSpider(CrawlSpider):
name = "OklahomaFinanceSpider"
allowed_domains = ["data.ok.gov"]
start_urls = [
"http://data.ok.gov/browse?f[0]=bundle_name%3ADataset&f[1]=im_field_categories%3A4191"
]
rules = (
Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//li[@class="pager-next"]',)), callback="parse_page", follow= True),
)
def parse_page(self, response):
for href in response.xpath('//*[contains(concat(" ", normalize-space(@class), " "),"search-results apachesolr_search-results")]/h3/a/@href'):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback=self.parse_dir_contents)
但是,第一页没有被删除。我对规则犯了什么错误?