Question

我的PHP脚本统计唯一身份访问者。与Google Analytics（分析）相比，这一数字是荒唐的；每天3万，但Google Analytics（分析）计数为2000。2000是正确的数字，因此我在脚本中添加了一个条件，以避免计算机器人和蜘蛛。

我还让它识别了机器人；在不到1分钟的时间里，我有100多个。内存有限，机器人正在消耗资源，我想避免这种情况。我的robots.txt：

# Allow Google, Yahoo and Bing to crawl all beside of /admin/
User-agent: Googlebot 
User-agent: Yahoo! Slurp
User-agent: msnbot 
Disallow: /admin/ 
Disallow: /analitics/
Disallow: /class/
Allow: / 

# Disallow all other to crawl everywhere
User-agent: * 
Disallow: /

有没有办法防止这么多请求？我不在乎Google或Bing的爬虫，但这太荒谬了。样本：

es ip:40.77.167.161 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.140 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.177 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.191 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

Answer 1

首先，我们确实区分了正确宣布的漫游器和自称是其他人的漫游器。有许多机器人由于各种原因而充当其他机器人（不仅是不好的原因）。在宣布为他人的一系列漫游器中，有一个子组也忽略了robots.txt。

就您而言，我们似乎有一个来自Microsoft的实际机器人。

机器人遵循robots.txt

实际上，这些IP似乎是来自Bing的正确IP地址。微软解释说，要找出whether an IP address actually belongs to them，我们应该使用nslookup。

这给我们：

$ nslookup 40.77.167.87
87.167.77.40.in-addr.arpa   name = msnbot-40-77-167-87.search.msn.com.

$ nslookup msnbot-40-77-167-87.search.msn.com
[...]
Non-authoritative answer:
Name:   msnbot-40-77-167-87.search.msn.com
Address: 40.77.167.87

Microsoft并未提供明确的示例来说明其bot will listen to in the robots.txt file的名称，但我认为他们希望看到bingbot而不是msnbot。

注意

Bingbot ，在为自己找到一组特定的指令后，将忽略通用部分中列出的指令，因此，除了您要使用的特定指令外，您还需要重复所有通用指令在文件的自己部分中为他们创建。

[...]

  在robots.txt文件中，
机器人被引用为用户代理。 [...]


（由我突出显示）

另一方面，Microsoft链接到Robots Database以获得有效的机械手名称列表，该列表根本没有列出Bingbot（仅列出了您已经使用的msnbot）。

还是，我会尝试将bingbot添加到robots.txt文件中的用户代理中，看看是否有帮助。

您没有在每个请求中包括请求的实际路径。似乎有where Microsoft's Bingbot cannot be blocked with robots.txt情况。

机器人没有遵守robots.txt

对于不遵守robots.txt的漫游器，您只能使用服务器端检测。您可以基于它们的IP（如果您始终看到来自同一IP的请求）或基于它们的用户代理（如果它们始终都宣布为同一用户代理）来阻止它们。

例如一些网站会阻止服务器端的用户代理scrapy（并返回一个空页面或404或类似内容），因为这是流行的Web抓取框架所使用的默认设置。

您还可以自动实施基于IP的阻止，例如如果您在x个小时内看到超过k个请求，则在接下来的10 * x个小时内屏蔽该IP。如果IP属于消费者ISP，那么这当然会导致误报，因为IP通常会将相同的IP地址提供给不同的用户。这意味着，您可能会阻止普通用户。但是，如果您每天有2000位访问者，那么我认为您网站的两个访问者具有相同的IP地址并且由于请求过多而被阻止的风险很低。

如何避免机器人和蜘蛛消耗我的服务器资源？

1 个答案:

机器人遵循robots.txt

机器人没有遵守robots.txt