Question

我遇到的问题是我的蜘蛛卡在论坛中，它可以保持爬行数天而根本没有找到任何东西。是否有限制抓取某个网站（基于start_url）到一定的时间？这个问题的任何其他解决方案？

Answer 1

我最终做的是使用process_links并创建一个检查时间的方法：

 rules = (Rule(LinkExtractor(allow=()), callback='parse_obj', follow=True, process_links='check_for_semi_dupe'),)


#Method to try avoiding spider traps and endless loops
def check_for_semi_dupe(self, links):
  for link in links:
    domainparts = urlparse(link.url)
    just_domain = domainparts[1].replace("www.", "")
    url_indexed = 0
    if just_domain not in self.processed_dupes:
      self.processed_dupes[just_domain] = datetime.datetime.now()
    else:
      url_indexed = 1
      timediff_in_sec = int((datetime.datetime.now() - self.processed_dupes[just_domain]).total_seconds())
    if just_domain in self.blocked:
      print "*** Domain '%s' was blocked! ***" % just_domain
      print "*** Link was: %s" % link.url
      continue
    elif url_indexed == 1 and timediff_in_sec > (self.time_threshhold * 60):
      self.blocked.append(just_domain)
      continue
    else:
      yield link

该方法记录域首次出现用于抓取的日期时间。使用所需的爬网时间（以分钟为单位）定义类变量＆＃34; time_threshold＆＃34;。当蜘蛛被送入爬行链接时，该方法确定链接应该被传递以进行爬行或阻塞。

Scrapy，您可以限制域级别的抓取时间吗？

1 个答案: