Question

想象一下，有一些抓取器抓取我的网站。我怎么能禁止他们，还有白名单Google Bots？

我想我可以找到谷歌机器人的IP范围，我正在考虑使用Redis存储当天的所有访问权限，如果在短时间内我看到来自同一IP的请求太多 - ＆gt;禁令。

我的堆栈是ubuntu服务器，nodejs，expressjs。

我看到的主要问题是这种检测是在Varnish背后。因此必须禁用Varnish缓存。有更好的想法，还是好的想法？

Answer 1

你可以使用Varnish ACL [1]，在apache中维护它可能会有点困难，但肯定会有效：

acl bad_boys {
  "666.666.666.0"/24; // Your evil range
  "696.696.696.696"; //Another evil IP
}

// ...

sub vcl_recv {
  if (client.ip ~ bad_boys) {
    error 403 "Forbidden";
  }
  // ...
}

// ...

您还可以使用白名单，使用用户代理或其他技术来确保它不是GoogleBot ...但我会在Varnish而不是Apache中为自己辩护。

[1] https://www.varnish-cache.org/docs/3.0/reference/vcl.html#acls

Answer 2

您可以使用robots.txt

停止抓取工具

User-agent: BadCrawler
Disallow: /

如果抓取工具遵循robots.txt规范

，则此解决方案有效

在我的网站上放慢流氓web srappers，仍然使用Varnish

2 个答案: