我对scrapy很新,我正在尝试使用CrawlSpider抓取网站,我想基于“下一步”按钮递归抓取它。但它没有用。我认为问题来自正则表达式,但我检查了这么多次,我找不到错误。它只会抓取登录页面而不进入下一页。
# -*- coding: utf-8 -*-
start_urls = ['https://shopping.yahoo.com/merchantrating/?mid=13652']
rules = (
Rule(LinkExtractor(allow = "/merchantrating/;_ylt=Anf3hF19R8MGFPwuYuJUny4cEb0F\?mid=13652&sort=1&start=\d+"), callback = 'parse_start_url', follow = True),
)
def parse_start_url(self, response):
sel = Selector(response)
contents = sel.xpath('//p')
for content in contents:
item = BedbugsItem()
item['pageContent'] = content.xpath('text()').extract()
self.items.append(item)
return self.items
答案 0 :(得分:0)
改为使用XPath:
rules = (
Rule(LinkExtractor(
restrict_xpaths = [
"//div[@class='pagination']//a[contains(., 'Next')]"
]),
callback = 'parse_start_url',
follow = True),
)