如何使CrawlSpider规则对上下文敏感?

时间:2014-03-26 06:56:59

标签: scrapy

我注意到,rule CrawlSpider提取每个非页面页面上的网址。
我是否可以仅在当前页面满足某些条件时启用rule(例如:url与正则表达式匹配)?

我有两页:


-------------------Page A-------------------
Page URL: http://www.site.com/pattern-match.html
--------------------------------------------

- [link](http://should-extract-this)
- [link](http://should-extract-this)
- [link](http://should-extract-this)

--------------------------------------------

--------------------Page B--------------------
Page URL: http://www.site.com/pattern-not-match.html
-----------------------------------------------

- [link](http://should-not-extract-this)
- [link](http://should-not-extract-this)
- [link](http://should-not-extract-this)

-----------------------------------------------

因此,规则应该只从PageA中提取网址。怎么做?谢谢!

1 个答案:

答案 0 :(得分:1)

我刚发现了一种将response注入rule的肮脏方式。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from scrapy.http import Request, HtmlResponse
from scrapy.contrib.spiders import CrawlSpider, Rule

import inspect

class MyCrawlSpider(CrawlSpider):

    def _requests_to_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            seen = seen.union(links)
            for link in links:
                r = Request(url=link.url, callback=self._response_downloaded)
                r.meta.update(rule=n, link_text=link.text)

                # ***>>> HACK <<<***
                # pass `response` as additional argument to `process_request`

                fun = rule.process_request
                if not hasattr(fun, 'nargs'):
                    fun.nargs = len(inspect.getargs(fun.func_code).args)
                if fun.nargs==1:
                    yield fun(r)
                elif fun.nargs==2:
                    yield fun(r, response)
                else:
                    raise Exception('too many arguments')

尝试一下:

def process_request(request, response):

    if 'magick' in response.url:
        return request

class TestSpider(MyCrawlSpider):

    name = 'test'
    allowed_domains = ['test.com']
    start_urls = ['http://www.test.com']

    rules = [
        Rule(SgmlLinkExtractor(restrict_xpaths='//a'), callback='parse_item', process_request=process_request),
    ]

    def parse_item(self, response):

        print response.url