嗨,我试图使用crawlspider,我创建了自己的拒绝规则
class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["careers-cooperhealth.icims.com"]
start_urls = ["careers-cooperhealth.icims.com"]
d= [0-9]
path_deny_base = [ '.(login)', '.(intro)', '(candidate)', '(referral)', '(reminder)', '(/search)',]
rules = (Rule (SgmlLinkExtractor(deny = path_deny_base,
allow=('careers-cooperhealth.icims.com/jobs/…;*')),
callback="parse_items",
follow= True), )
仍然我的蜘蛛抓取了https://careers-cooperhealth.icims.com/jobs/22660/registered-nurse-prn/login这样的页面,不应该抓取登录这里有什么问题?
答案 0 :(得分:2)
只需改变方式(无点和括号):
deny = ['login', 'intro', 'candidate', 'referral', 'reminder', 'search']
allow = ['jobs']
rules = (Rule (SgmlLinkExtractor(deny = deny,
allow=allow,
restrict_xpaths=('*')),
callback="parse_items",
follow= True),)
这意味着在提取的链接中没有login
或intro
等,只提取其中包含jobs
的链接。
这里是抓取链接https://careers-cooperhealth.icims.com/jobs/intro?hashed=0
并打印“YAHOO”的整个蜘蛛代码!':
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
name = "craigs"
allowed_domains = ["careers-cooperhealth.icims.com"]
start_urls = ["https://careers-cooperhealth.icims.com"]
deny = ['login', 'intro', 'candidate', 'referral', 'reminder', 'search']
allow = ['jobs']
rules = (Rule (SgmlLinkExtractor(deny = deny,
allow=allow,
restrict_xpaths=('*')),
callback="parse_items",
follow= True),)
def parse_items(self, response):
print "YAHOO!"
希望有所帮助。