出于我正在处理的应用程序的目的,我需要抓紧时间才能脱离抓取范围,并从特定的任意URL开始再次抓取。
预期的行为是使抓取行为返回到特定的URL,如果满足特定条件,则可以在参数中提供该URL。
我正在使用CrawlSpider,但不知道如何实现:
class MyCrawlSpider(CrawlSpider):
name = 'mycrawlspider'
initial_url = ""
def __init__(self, initial_url, *args, **kwargs):
self.initial_url = initial_url
domain = "mydomain.com"
self.start_urls = [initial_url]
self.allowed_domains = [domain]
self.rules = (
Rule(LinkExtractor(allow=[r"^http[s]?://(www.)?" + domain + "/.*"]), callback='parse_item', follow=True),
)
super(MyCrawlSpider, self)._compile_rules()
def parse_item(self, response):
if(some_condition is True):
# force scrapy to go back to home page and recrawl
print("Should break out")
else:
print("Just carry on")
我试图放置
return scrapy.Request(self.initial_url, callback=self.parse_item)
在someCondition is True
的分支中,但没有成功。非常感谢您的帮助,我们一直在努力设法解决这一问题。
答案 0 :(得分:0)
您可以设置一个适当处理的自定义异常,就像这样...
请随时使用适用于CrawlSpider的语法进行编辑
class RestartException(Exception):
pass
class MyCrawlSpider(CrawlSpider):
name = 'mycrawlspider'
initial_url = ""
def __init__(self, initial_url, *args, **kwargs):
self.initial_url = initial_url
domain = "mydomain.com"
self.start_urls = [initial_url]
self.allowed_domains = [domain]
self.rules = (
Rule(LinkExtractor(allow=[r"^http[s]?://(www.)?" + domain + "/.*"]), callback='parse_item', follow=True),
)
super(MyCrawlSpider, self)._compile_rules()
def parse_item(self, response):
if(some_condition is True):
print("Should break out")
raise RestartException("We're restarting now")
else:
print("Just carry on")
siteName = "http://whatever.com"
crawler = MyCrawlSpider(siteName)
while True:
try:
#idk how you start this thing, but do that
crawler.run()
break
except RestartException as err:
print(err.args)
crawler.something = err.args
continue
print("I'm done!")