在Scrapy中创建可编辑的CrawlSpider规则

时间:2013-03-22 19:03:43

标签: python web-crawler scrapy

我一直在尝试创建一个可以轻松更改的简单Scrapy CrawlSpider脚本,但是我无法弄清楚如何让链接提取器规则正常工作。

这是我的代码:

class LernaSpider(CrawlSpider):
"""Our ad-hoc spider"""

name = "lerna"

def __init__(self, url, allow_follow='.*', deny_follow='', allow_extraction='.*', deny_extraction=''):
    parsed_url = urlparse(url)
    domain = str(parsed_url.netloc)
    self.allowed_domains = [domain]
    self.start_urls = [url]
    self.rules = (
        # Extract links
        # and follow links from them (since no callback means follow=True by default).
        Rule(SgmlLinkExtractor(allow=(allow_follow, ), deny=(deny_follow, ))),

        # Extract links and parse them with the spider's method parse_item
        Rule(SgmlLinkExtractor(allow=(allow_extraction, ), deny=(deny_extraction, )), callback='parse_item'),
    )

    super(LernaSpider, self).__init__()

def parse_item(self, response):

    print 'Crawling... %s' % response.url
    # more stuff here

我有这段代码,但我永远无法让允许/拒绝规则正常工作,我真的不明白为什么。离开空字符串会导致它拒绝一切吗?我认为,因为它是一个RE,所以如果我输入'。*'或其他什么,它只会做一揽子拒绝。

任何帮助都将不胜感激。

1 个答案:

答案 0 :(得分:3)

你自己实例化蜘蛛吗?类似的东西:

spider = LernaSpider('http://example.com')

因为否则如果您从命令行运行$scrapy crawl lerna,则您错误地使用url作为构造函数中的第一个参数(应该是名称),并且您也没有将它传递给super。也许试试这个:

class LernaSpider(CrawlSpider):
    """Our ad-hoc spider"""

    name = "lerna"

    def __init__(self, name=None, url=url, allow_follow='.*', deny_follow='', allow_extraction='.*', deny_extraction='', **kw):
        parsed_url = urlparse(url)
        domain = str(parsed_url.netloc)
        self.allowed_domains = [domain]
        self.start_urls = [url]
        self.rules = (
            # Extract links
            # and follow links from them (since no callback means follow=True by default).
            Rule(SgmlLinkExtractor(allow=allow_follow, deny=deny_follow)),

            # Extract links and parse them with the spider's method parse_item
            Rule(SgmlLinkExtractor(allow=allow_extraction, deny=deny_extraction), callback='parse_item'),
        )
        super(LernaSpider, self).__init__(name, **kw)

    def parse_item(self, response):
        print 'Crawling... %s' % response.url
        # more stuff here

正则表达式的东西看起来很好:空值允许所有并拒绝所有。