Scrapy,您可以限制域级别的抓取时间吗?

时间:2015-02-20 08:14:45

标签: python web-crawler scrapy

我遇到的问题是我的蜘蛛卡在论坛中,它可以保持爬行数天而根本没有找到任何东西。是否有限制抓取某个网站(基于start_url)到一定的时间?这个问题的任何其他解决方案?

1 个答案:

答案 0 :(得分:0)

我最终做的是使用process_links并创建一个检查时间的方法:

 rules = (Rule(LinkExtractor(allow=()), callback='parse_obj', follow=True, process_links='check_for_semi_dupe'),)


#Method to try avoiding spider traps and endless loops
def check_for_semi_dupe(self, links):
  for link in links:
    domainparts = urlparse(link.url)
    just_domain = domainparts[1].replace("www.", "")
    url_indexed = 0
    if just_domain not in self.processed_dupes:
      self.processed_dupes[just_domain] = datetime.datetime.now()
    else:
      url_indexed = 1
      timediff_in_sec = int((datetime.datetime.now() - self.processed_dupes[just_domain]).total_seconds())
    if just_domain in self.blocked:
      print "*** Domain '%s' was blocked! ***" % just_domain
      print "*** Link was: %s" % link.url
      continue
    elif url_indexed == 1 and timediff_in_sec > (self.time_threshhold * 60):
      self.blocked.append(just_domain)
      continue
    else:
      yield link

该方法记录域首次出现用于抓取的日期时间。使用所需的爬网时间(以分钟为单位)定义类变量" time_threshold"。当蜘蛛被送入爬行链接时,该方法确定链接应该被传递以进行爬行或阻塞。