我正在尝试:
class SpiderSpider(CrawlSpider):
name = "lolies"
allowed_domains = ["domain.com"]
start_urls = ['http://www.domain.com/directory/lol2']
rules = (Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+$']), follow=True), Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\w+/\d+$']), follow=True),Rule(SgmlLinkExtractor(allow=[r'directory/lol2/\d+$']), callback=self.parse_loly))
def parse_loly(self, response):
print 'Hi this is the loly page %s' % response.url
return
那让我回头:
NameError: name 'self' is not defined
如果我将回调更改为callback="self.parse_loly"
似乎永远不会被调用并打印该URL。
但似乎是在没有问题的情况下抓取网站,因为我为该规则获得了许多Crawled 200调试消息。
我可能做错了什么?
先谢谢你们!
答案 0 :(得分:1)
看起来parse_loly
的空格未正确对齐。 Python对空格敏感,因此对于解释器来说,它看起来像是SpiderSpider之外的方法。
您可能还希望按照PEP8将规则行拆分为较短的行。
试试这个:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class SpiderSpider(CrawlSpider):
name = "lolies"
allowed_domains = ["domain.com"]
start_urls = ['http://www.domain.com/directory/lol2/']
rules = (
Rule(SgmlLinkExtractor(allow=('\w+$', ))),
Rule(SgmlLinkExtractor(allow=('\w+/\d+$', ))),
Rule(SgmlLinkExtractor(allow=('\d+$',)), callback='parse_loly'),
)
def parse_loly(self, response):
print 'Hi this is the loly page %s' % response.url
return None