我遇到的问题是我的蜘蛛卡在论坛中,它可以保持爬行数天而根本没有找到任何东西。是否有限制抓取某个网站(基于start_url)到一定的时间?这个问题的任何其他解决方案?
答案 0 :(得分:0)
我最终做的是使用process_links并创建一个检查时间的方法:
rules = (Rule(LinkExtractor(allow=()), callback='parse_obj', follow=True, process_links='check_for_semi_dupe'),)
#Method to try avoiding spider traps and endless loops
def check_for_semi_dupe(self, links):
for link in links:
domainparts = urlparse(link.url)
just_domain = domainparts[1].replace("www.", "")
url_indexed = 0
if just_domain not in self.processed_dupes:
self.processed_dupes[just_domain] = datetime.datetime.now()
else:
url_indexed = 1
timediff_in_sec = int((datetime.datetime.now() - self.processed_dupes[just_domain]).total_seconds())
if just_domain in self.blocked:
print "*** Domain '%s' was blocked! ***" % just_domain
print "*** Link was: %s" % link.url
continue
elif url_indexed == 1 and timediff_in_sec > (self.time_threshhold * 60):
self.blocked.append(just_domain)
continue
else:
yield link
该方法记录域首次出现用于抓取的日期时间。使用所需的爬网时间(以分钟为单位)定义类变量" time_threshold"。当蜘蛛被送入爬行链接时,该方法确定链接应该被传递以进行爬行或阻塞。