Scrapy中的Linkextractor,分页和2个深度链接

时间:2017-11-07 11:20:13

标签: python scrapy

我试图了解linkextractor如何在Scrapy中工作。 我想要完成的任务:

  • 按照起始页上的分页

  • 搜索网址并扫描模式中的所有链接

  • 在找到的链接页面中,按照该页面上与该模式匹配的其他链接并删除该页面

我的代码:

class ToScrapeMyspider(CrawlSpider):
    name            = "myspider"
    allowed_domains = ["myspider.com"]
    start_urls      = ["www.myspider.com/category.php?k=766"]
    rules = (
        Rule(LinkExtractor(restrict_xpaths='//link[@rel="next"]/a'), follow=True),
        Rule(LinkExtractor(allow=r"/product.php?p=\d+$"), callback='parse_spider')
)

    def parse_spider(self, response):
        Request(allow=r"/product.php?e=\d+$",callback=self.parse_spider2)

    def parse_spider2(self, response):
        #EXTRACT AND PARSE DATA HERE ETC (IS WORKING)

我的分页链接如下:

<link rel="next" href="https://myspider.com/category.php?k=766&amp;amp;s=100" >

首先我从restrict_xpaths

收到错误
'str' object has no attribute 'iter'

但我想我搞砸了事情

1 个答案:

答案 0 :(得分:1)

最后工作:

rules = (
          Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@rel="next"]',)), follow=True),
          Rule(LinkExtractor(allow=('product\.php', )), callback='parse_sider'),
)


BASE_URL = 'https://myspider.com/'

def parse_spy(self, response):
    links = response.xpath('//li[@id="id"]/a/@href').extract()
    for link in links:
        absolute_url = self.BASE_URL + link
        yield scrapy.Request(absolute_url, callback=self.parse_spider2)