我有一只斗志旺盛的蜘蛛:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class ExampleSpider(CrawlSpider):
name = "spidermaster"
allowed_domains = ["www.test.com"]
start_urls = ["http://www.test.com/"]
rules = [Rule(SgmlLinkExtractor(allow=()),
follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item'),
]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
我试图抓取整个网页,除了特定路径下的内容。
例如,我想抓取除www.test.com/too_much_links以外的所有测试网站。
提前致谢
答案 0 :(得分:0)
我通常以这种方式这样做:
ignore = ['too_much_links', 'many_links']
rules = [Rule(SgmlLinkExtractor(allow=(), deny=ignore), follow=True),
Rule(SgmlLinkExtractor(allow=(), deny=ignore), callback='parse_item'),
]