我的PHP脚本统计唯一身份访问者。与Google Analytics(分析)相比,这一数字是荒唐的;每天3万,但Google Analytics(分析)计数为2000。2000是正确的数字,因此我在脚本中添加了一个条件,以避免计算机器人和蜘蛛。
我还让它识别了机器人;在不到1分钟的时间里,我有100多个。内存有限,机器人正在消耗资源,我想避免这种情况。我的robots.txt:
# Allow Google, Yahoo and Bing to crawl all beside of /admin/
User-agent: Googlebot
User-agent: Yahoo! Slurp
User-agent: msnbot
Disallow: /admin/
Disallow: /analitics/
Disallow: /class/
Allow: /
# Disallow all other to crawl everywhere
User-agent: *
Disallow: /
有没有办法防止这么多请求?我不在乎Google或Bing的爬虫,但这太荒谬了。样本:
es ip:40.77.167.161 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.140 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.177 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.191 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
pt ip:40.77.167.178 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
es ip:40.77.167.87 pais:United States cidade:Boydton user agent:mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)
答案 0 :(得分:0)
首先,我们确实区分了正确宣布的漫游器和自称是其他人的漫游器。有许多机器人由于各种原因而充当其他机器人(不仅是不好的原因)。在宣布为他人的一系列漫游器中,有一个子组也忽略了robots.txt
。
就您而言,我们似乎有一个来自Microsoft的实际机器人。
实际上,这些IP似乎是来自Bing的正确IP地址。微软解释说,要找出whether an IP address actually belongs to them,我们应该使用nslookup
。
这给我们:
$ nslookup 40.77.167.87
87.167.77.40.in-addr.arpa name = msnbot-40-77-167-87.search.msn.com.
$ nslookup msnbot-40-77-167-87.search.msn.com
[...]
Non-authoritative answer:
Name: msnbot-40-77-167-87.search.msn.com
Address: 40.77.167.87
Microsoft并未提供明确的示例来说明其bot will listen to in the robots.txt
file的名称,但我认为他们希望看到bingbot
而不是msnbot
。
注意
Bingbot ,在为自己找到一组特定的指令后,将忽略通用部分中列出的指令,因此,除了您要使用的特定指令外,您还需要重复所有通用指令在文件的自己部分中为他们创建。
[...]
在robots.txt文件中,
- 机器人被引用为用户代理。 [...]
(由我突出显示)
另一方面,Microsoft链接到Robots Database以获得有效的机械手名称列表,该列表根本没有列出Bingbot(仅列出了您已经使用的msnbot)。
还是,我会尝试将bingbot
添加到robots.txt
文件中的用户代理中,看看是否有帮助。
您没有在每个请求中包括请求的实际路径。似乎有where Microsoft's Bingbot cannot be blocked with robots.txt情况。
对于不遵守robots.txt的漫游器,您只能使用服务器端检测。您可以基于它们的IP(如果您始终看到来自同一IP的请求)或基于它们的用户代理(如果它们始终都宣布为同一用户代理)来阻止它们。
例如一些网站会阻止服务器端的用户代理scrapy
(并返回一个空页面或404或类似内容),因为这是流行的Web抓取框架所使用的默认设置。
您还可以自动实施基于IP的阻止,例如如果您在x个小时内看到超过k个请求,则在接下来的10 * x个小时内屏蔽该IP。如果IP属于消费者ISP,那么这当然会导致误报,因为IP通常会将相同的IP地址提供给不同的用户。这意味着,您可能会阻止普通用户。但是,如果您每天有2000位访问者,那么我认为您网站的两个访问者具有相同的IP地址并且由于请求过多而被阻止的风险很低。