如何避免机器人和蜘蛛消耗我的服务器资源?

时间:2018-06-29 21:40:37

标签: .htaccess web-crawler bots

我的PHP脚本统计唯一身份访问者。与Google Analytics(分析)相比,这一数字是荒唐的;每天3万,但Google Analytics(分析)计数为2000。2000是正确的数字,因此我在脚本中添加了一个条件,以避免计算机器人和蜘蛛。

我还让它识别了机器人;在不到1分钟的时间里,我有100多个。内存有限,机器人正在消耗资源,我想避免这种情况。我的robots.txt:

# Allow Google, Yahoo and Bing to crawl all beside of /admin/
User-agent: Googlebot 
User-agent: Yahoo! Slurp
User-agent: msnbot 
Disallow: /admin/ 
Disallow: /analitics/
Disallow: /class/
Allow: / 

# Disallow all other to crawl everywhere
User-agent: * 
Disallow: / 

有没有办法防止这么多请求?我不在乎Google或Bing的爬虫,但这太荒谬了。样本:

es ip:40.77.167.161 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.140 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.177 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.191 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

1 个答案:

答案 0 :(得分:0)

首先,我们确实区分了正确宣布的漫游器和自称是其他人的漫游器。有许多机器人由于各种原因而充当其他机器人(不仅是不好的原因)。在宣布为他人的一系列漫游器中,有一个子组也忽略了robots.txt

就您而言,我们似乎有一个来自Microsoft的实际机器人。

机器人遵循robots.txt

实际上,这些IP似乎是来自Bing的正确IP地址。微软解释说,要找出whether an IP address actually belongs to them,我们应该使用nslookup

这给我们:

$ nslookup 40.77.167.87
87.167.77.40.in-addr.arpa   name = msnbot-40-77-167-87.search.msn.com.

$ nslookup msnbot-40-77-167-87.search.msn.com
[...]
Non-authoritative answer:
Name:   msnbot-40-77-167-87.search.msn.com
Address: 40.77.167.87

Microsoft并未提供明确的示例来说明其bot will listen to in the robots.txt file的名称,但我认为他们希望看到bingbot而不是msnbot

  

注意

     

Bingbot ,在为自己找到一组特定的指令后,将忽略通用部分中列出的指令,因此,除了您要使用的特定指令外,您还需要重复所有通用指令在文件的自己部分中为他们创建。

     

[...]

     
      在robots.txt文件中,
  1. 机器人被引用为用户代理。 [...]
  2.   
     

(由我突出显示)

另一方面,Microsoft链接到Robots Database以获得有效的机械手名称列表,该列表根本没有列出Bingbot(仅列出了您已经使用的msnbot)。

还是,我会尝试将bingbot添加到robots.txt文件中的用户代理中,看看是否有帮助。

您没有在每个请求中包括请求的实际路径。似乎有where Microsoft's Bingbot cannot be blocked with robots.txt情况。

机器人没有遵守robots.txt

对于不遵守robots.txt的漫游器,您只能使用服务器端检测。您可以基于它们的IP(如果您始终看到来自同一IP的请求)或基于它们的用户代理(如果它们始终都宣布为同一用户代理)来阻止它们。

例如一些网站会阻止服务器端的用户代理scrapy(并返回一个空页面或404或类似内容),因为这是流行的Web抓取框架所使用的默认设置。

您还可以自动实施基于IP的阻止,例如如果您在x个小时内看到超过k个请求,则在接下来的10 * x个小时内屏蔽该IP。如果IP属于消费者ISP,那么这当然会导致误报,因为IP通常会将相同的IP地址提供给不同的用户。这意味着,您可能会阻止普通用户。但是,如果您每天有2000位访问者,那么我认为您网站的两个访问者具有相同的IP地址并且由于请求过多而被阻止的风险很低。