通过JavaScript检测搜索爬虫

时间:2013-11-19 23:37:06

标签: javascript web-crawler bots

我想知道如何检测搜索爬虫?我问的原因是因为如果用户代理是机器人,我想要禁止某些JavaScript调用。

我找到了一个如何检测某个浏览器的示例,但是找不到如何检测搜索爬虫的示例:

/MSIE (\d+\.\d+);/.test(navigator.userAgent); //test for MSIE x.x

我想阻止的搜索抓取工具示例:

Google 
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 
Googlebot/2.1 (+http://www.googlebot.com/bot.html) 
Googlebot/2.1 (+http://www.google.com/bot.html) 

Baidu 
Baiduspider+(+http://www.baidu.com/search/spider_jp.html) 
Baiduspider+(+http://www.baidu.com/search/spider.htm) 
BaiDuSpider 

9 个答案:

答案 0 :(得分:36)

这是ruby UA agent_orange库用来测试userAgent看起来像机器人的正则表达式。您可以通过引用bot userAgent list here

来缩小特定机器人的范围
/bot|crawler|spider|crawling/i

例如,您有一些对象util.browser,您可以存储用户所使用的设备类型:

util.browser = {
   bot: /bot|googlebot|crawler|spider|robot|crawling/i.test(navigator.userAgent),
   mobile: ...,
   desktop: ...
}

答案 1 :(得分:12)

试试这个。它基于https://github.com/monperrus/crawler-user-agents

上提供的抓取工具列表
var botPattern = "(googlebot\/|Googlebot-Mobile|Googlebot-Image|Google favicon|Mediapartners-Google|bingbot|slurp|java|wget|curl|Commons-HttpClient|Python-urllib|libwww|httpunit|nutch|phpcrawl|msnbot|jyxobot|FAST-WebCrawler|FAST Enterprise Crawler|biglotron|teoma|convera|seekbot|gigablast|exabot|ngbot|ia_archiver|GingerCrawler|webmon |httrack|webcrawler|grub.org|UsineNouvelleCrawler|antibot|netresearchserver|speedy|fluffy|bibnum.bnf|findlink|msrbot|panscient|yacybot|AISearchBot|IOI|ips-agent|tagoobot|MJ12bot|dotbot|woriobot|yanga|buzzbot|mlbot|yandexbot|purebot|Linguee Bot|Voyager|CyberPatrol|voilabot|baiduspider|citeseerxbot|spbot|twengabot|postrank|turnitinbot|scribdbot|page2rss|sitebot|linkdex|Adidxbot|blekkobot|ezooms|dotbot|Mail.RU_Bot|discobot|heritrix|findthatfile|europarchive.org|NerdByNature.Bot|sistrix crawler|ahrefsbot|Aboundex|domaincrawler|wbsearchbot|summify|ccbot|edisterbot|seznambot|ec2linkfinder|gslfbot|aihitbot|intelium_bot|facebookexternalhit|yeti|RetrevoPageAnalyzer|lb-spider|sogou|lssbot|careerbot|wotbox|wocbot|ichiro|DuckDuckBot|lssrocketcrawler|drupact|webcompanycrawler|acoonbot|openindexspider|gnam gnam spider|web-archive-net.com.bot|backlinkcrawler|coccoc|integromedb|content crawler spider|toplistbot|seokicks-robot|it2media-domain-crawler|ip-web-crawler.com|siteexplorer.info|elisabot|proximic|changedetection|blexbot|arabot|WeSEE:Search|niki-bot|CrystalSemanticsBot|rogerbot|360Spider|psbot|InterfaxScanBot|Lipperhey SEO Service|CC Metadata Scaper|g00g1e.net|GrapeshotCrawler|urlappendbot|brainobot|fr-crawler|binlar|SimpleCrawler|Livelapbot|Twitterbot|cXensebot|smtbot|bnf.fr_bot|A6-Indexer|ADmantX|Facebot|Twitterbot|OrangeBot|memorybot|AdvBot|MegaIndex|SemanticScholarBot|ltx71|nerdybot|xovibot|BUbiNG|Qwantify|archive.org_bot|Applebot|TweetmemeBot|crawler4j|findxbot|SemrushBot|yoozBot|lipperhey|y!j-asr|Domain Re-Animator Bot|AddThis)";
var re = new RegExp(botPattern, 'i');
var userAgent = 'Googlebot/2.1 (+http://www.googlebot.com/bot.html)';
if (re.test(userAgent)) {
    console.log('the user agent is a crawler!');
}

答案 2 :(得分:11)

根据this post,以下正则表达式将匹配最大的搜索引擎。

/bot|google|baidu|bing|msn|duckduckbot|teoma|slurp|yandex/i
    .test(navigator.userAgent)

匹配搜索引擎是:

  • 百度
  • Bingbot / MSN
  • DuckDuckGo
  • 谷歌
  • TEOMA的
  • 雅虎!
  • Yandex的

此外,我已将bot添加为较小的抓取工具/机器人。

答案 3 :(得分:1)

这可能有助于检测机器人用户代理同时还能使事情更有条理:

Javascript

const detectRobot = (userAgent) => {
  const robots = new RegExp([
    /bot/,/spider/,/crawl/,                            // GENERAL TERMS
    /APIs-Google/,/AdsBot/,/Googlebot/,                // GOOGLE ROBOTS
    /mediapartners/,/Google Favicon/,
    /FeedFetcher/,/Google-Read-Aloud/,
    /DuplexWeb-Google/,/googleweblight/,
    /bing/,/yandex/,/baidu/,/duckduck/,/yahoo/,        // OTHER ENGINES
    /ecosia/,/ia_archiver/,
    /facebook/,/instagram/,/pinterest/,/reddit/,       // SOCIAL MEDIA
    /slack/,/twitter/,/whatsapp/,/youtube/,
    /semrush/,                                         // OTHER
  ].map((r) => r.source).join("|"),"i");               // BUILD REGEXP + "i" FLAG

  return robots.test(userAgent);
};

打字稿

const detectRobot = (userAgent: string): boolean => {
  const robots = new RegExp(([
    /bot/,/spider/,/crawl/,                               // GENERAL TERMS
    /APIs-Google/,/AdsBot/,/Googlebot/,                   // GOOGLE ROBOTS
    /mediapartners/,/Google Favicon/,
    /FeedFetcher/,/Google-Read-Aloud/,
    /DuplexWeb-Google/,/googleweblight/,
    /bing/,/yandex/,/baidu/,/duckduck/,/yahoo/,           // OTHER ENGINES
    /ecosia/,/ia_archiver/,
    /facebook/,/instagram/,/pinterest/,/reddit/,          // SOCIAL MEDIA
    /slack/,/twitter/,/whatsapp/,/youtube/,
    /semrush/,                                            // OTHER
  ] as RegExp[]).map((r) => r.source).join("|"),"i");     // BUILD REGEXP + "i" FLAG

  return robots.test(userAgent);
};

在服务器上使用:

const userAgent = req.get('user-agent');
const isRobot = detectRobot(userAgent);

在“客户端”/机器人可能正在使用的某些幻像浏览器上使用:

const userAgent = navigator.userAgent;
const isRobot = detectRobot(userAgent);

Google 抓取工具概述:

https://developers.google.com/search/docs/advanced/crawling/overview-google-crawlers

答案 4 :(得分:0)

“MSIE x.x测试”示例只是用于针对正则表达式测试userAgent的代码。在您的示例中,Regexp是

/MSIE (\d+\.\d+);/

一部分。只需将您想要测试用户代理的Regexp替换它。这就像是

/Google|Baidu|Baiduspider/.test(navigator.userAgent)

其中竖线是“或”运算符,用于将用户代理与所有提到的机器人进行匹配。有关正则表达式的更多信息,您可以参考this site,因为javascript使用perl样式的RegExp。

答案 5 :(得分:0)

isTrusted属性可以为您提供帮助。

事件接口的isTrusted只读属性是一个布尔值 当事件是由用户操作生成时为true,否则为false 通过脚本创建或修改事件或通过以下事件调度事件时 EventTarget.dispatchEvent()。

例如:

isCrawler() {
  return event.isTrusted;
}

⚠请注意,IE不兼容。

从文档中获取更多信息:https://developer.mozilla.org/en-US/docs/Web/API/Event/isTrusted

答案 6 :(得分:0)

我结合了上面的一些内容并删除了一些冗余。我在一个半私有站点的 .htaccess 中使用它:

(google|bot|crawl|spider|slurp|baidu|bing|msn|teoma|yandex|java|wget|curl|Commons-HttpClient|Python-urllib|libwww|httpunit|nutch|biglotron|convera|gigablast|archive|webmon|httrack|grub|netresearchserver|speedy|fluffy|bibnum|findlink|panscient|IOI|ips-agent|yanga|Voyager|CyberPatrol|postrank|page2rss|linkdex|ezooms|heritrix|findthatfile|Aboundex|summify|ec2linkfinder|facebook|slack|instagram|pinterest|reddit|twitter|whatsapp|yeti|RetrevoPageAnalyzer|sogou|wotbox|ichiro|drupact|coccoc|integromedb|siteexplorer|proximic|changedetection|WeSEE|scrape|scaper|g00g1e|binlar|indexer|MegaIndex|ltx71|BUbiNG|Qwantify|lipperhey|y!j-asr|AddThis)

答案 7 :(得分:0)

人们可能会想查看新的 navigator.webdriver 属性,它允许机器人通知您他们是机器人:

https://developer.mozilla.org/en-US/docs/Web/API/Navigator/webdriver

<块引用>

导航器界面的 webdriver 只读属性指示用户代理是否由自动化控制。

<块引用>

它定义了一种标准方式,用于协作用户代理以通知文档它是由 WebDriver 控制的,例如,以便在自动化过程中可以触发替代代码路径。

它受到所有主要浏览器的支持,并受到 Puppeteer 等主要浏览器自动化软件的尊重。自动化软件的用户当然可以禁用它,因此它应该只用于检测“好”机器人。

答案 8 :(得分:-1)

您已经披露了允许作为用户代理的内容...但是通常,爬网程序会使用各种ACCEPTED用户代理,就我的经验来看,您不能在不影响实际用户的情况下限制任何爬网程序。