Question

我想抓取一个有两个级别网址的网站，第一个级别是多页面列表，网址如下：

这样的页面布局：

列出项目链接1
列出项目链接2
列出项目链接3
列出项目链接4

1,2,3,4,5 ... nextpage

，第二级是详细信息页面，网址如下：

我的蜘蛛代码是：

import scrapy
from scrapy.spiders.crawl import CrawlSpider
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.spiders.crawl import Rule
from urlparse import urljoin

class MyCrawler(CrawlSpider):
    name = "AnjukeCrawler"

    start_urls=[
        "http://www.example.com/group/"
    ]

    rules = [
        Rule(LxmlLinkExtractor(allow=(),
                               restrict_xpaths=(["//div[@class='multi-   page']/a[@class='aNxt']"])),
                               callback='parse_list_page',
                               follow=True)
    ]

    def parse_list_page(self, response):

        list_page=response.xpath("//div[@class='li-  itemmod']/div/h3/a/@href").extract()

        for item in list_page:
            yield scrapy.http.Request(self,url=urljoin(response.url,item),callback=self.parse_detail_page)


    def parse_detail_page(self,response):

        community_name=response.xpath("//dl[@class='comm-l-detail float-l']/dd")[0].extract()

        self.log(community_name,2)

我的问题是：我的parse_detail_page似乎从未运行过，有人可以告诉我原因吗？我该如何解决？

谢谢！

Answer 1

如果我理解你的问题，那么你在寻找的是请求链。请求链是指通过请求将从response1收集的数据转移到response2的时间：

this.state.counter

Answer 2

你永远不应该覆盖parse的{{1}}方法，因为它包含此类蜘蛛的核心解析逻辑，所以你的CrawlSpider应该是def parse( - 这个拼写错误是你的问题。

但是你的规则看起来像开销，因为只使用回调和def parse_list_page(来提取链接，最好考虑使用规则列表并像这样重写你的蜘蛛：

follow=True

BTW，链接提取器中的括号太多：class MyCrawler(CrawlSpider): name = "AnjukeCrawler" start_urls = [ "http://www.example.com/group/" ] rules = [ Rule(LxmlLinkExtractor(restrict_xpaths="//div[@class='multi-page']/a[@class='aNxt']"), follow=True), Rule(LxmlLinkExtractor(restrict_xpaths="//div[@class='li-itemmod']/div/h3/a/@href"), callback='parse_detail_page'), ] def parse_detail_page(self, response): community_name = response.xpath("//dl[@class='comm-l-detail float-l']/dd")[0].extract() self.log(community_name, 2)

如何使用scrapy框架抓取两级网站

2 个答案: