Scrapy CrawlSpider规则有多个回调

时间:2014-05-16 13:04:02

标签: python scrapy

我想创建一个实现scrapy CrawlSpider的ExampleSpider。我的ExampleSpider应该能够处理只包含艺术家信息的页面, 仅包含专辑信息的页面,以及包含专辑和艺术家信息的其他一些页面。

我能够处理前两个场景。但问题发生在第三种情况。我正在使用parse_artist(response)方法处理艺术家数据,parse_album(response)方法来处理相册数据。 我的问题是,如果一个页面同时包含艺术家和专辑数据,我该如何定义我的规则?

  1. Shoud我喜欢下面的? (相同网址格式的两条规则)
  2. 我应该多次回调吗? (scrapy是否支持多个回调?)
  3. 还有其他办法吗? (适当的方式)

    class ExampleSpider(CrawlSpider):
        name = 'example'
    
        start_urls = ['http://www.example.com']
    
        rules = [
            Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_artist', follow=True),
            Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_album', follow=True),
            # more rules .....
        ]
    
        def parse_artist(self, response):
            artist_item = ArtistItem()
            try:
                # do the scrape and assign to ArtistItem
            except Exception:
                # ignore for now
                pass
            return artist_item
            pass
    
        def parse_album(self, response):
            album_item = AlbumItem()
            try:
                # do the scrape and assign to AlbumItem
            except Exception:
                # ignore for now
                pass
            return album_item
            pass
        pass
    

1 个答案:

答案 0 :(得分:8)

CrawlSpider调用_requests_to_follow()方法来提取网址并生成要遵循的请求:

def _requests_to_follow(self, response):
    if not isinstance(response, HtmlResponse):
        return
    seen = set()
    for n, rule in enumerate(self._rules):
        links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
        if links and rule.process_links:
            links = rule.process_links(links)
        seen = seen.union(links)
        for link in links:
            r = Request(url=link.url, callback=self._response_downloaded)
            r.meta.update(rule=n, link_text=link.text)
            yield rule.process_request(r)

如你所见:

  • 已处理变量seen存储urls
  • 每个url最多只能解析一个callback

您可以定义parse_item()来呼叫parse_artist()parse_album()

rules = [
    Rule(SgmlLinkExtractor(allow=[r'same regex_rule']), callback='parse_item', follow=True),
    # more rules .....
]

def parse_item(self, response):

    yield self.parse_artist(response)
    yield self.parse_album(response)