Question

我在针对一串代理正确检测我的用户代理时遇到问题，即使这是列出并通过preg_match运行，无论我尝试什么样的机器人模仿我永远不会得到积极的。我当前的HTTP_USER_AGENT是

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

my $ str变量看起来像这样

/(abcdatos botlink|ariadne|aspider|atn worldwide|auresys|acme.spider|ahoy!|alkaline|alkalinebot|anthill|arachnophilia|arale|araneo|araybot|architextspider|aretha|ask jeeves|askjeeves|atomz|bspider|backrub|bay spider|bayspider|big brother|bjaaland|blackwidow|bloodhound|borg-bot|botlink|boxseabot|cactvs chemistry spider|cmc|calif|cassandra|checkbot|christcrawler.com|collective|combine system|computingsite robi|conceptbot|confuzzledbot|coolbot|cusco|cyberspyder|cydralspider|diibot|dnabot|dwcp|deweb|desert realm spider|die blinde kuh|dienstspider|digger|digimarc|digital integrity robot|direct hit grabber|download express|dragonbot|eit link verifier robot|elfinbot|esi|esirover|esismartspider|ebiness|emacs-w3|esther|evliya celebi|fdse|fastcrawler|felix ide|fetchrover|fish search|fluid dynamics robot|fouineur|freecrawl|funnelweb|gcreep|geneva|getbot|geturl|getterrobo-plus|getterroboplus puu|golem|googlebot|grapnel|griffon|gromit|gulliver|gulper|hi (html index) search|hku www octopus|htmlgobble|hambot|harvest|hometown spider pro|hulud|hyper-decontextualizer|hÃƒÂ¤mÃƒÂ¤hÃƒÂ¤kki|i, robot|ibm_planetwide|ingrid|ilse|imagelock|incywincy|infoseek robot 1.0|infospiders|informant|infoseek sidewinder|ingrid|inktomi slurp|inspector web|intelliagent|internet cruiser robot|internet shinchakubin|iron33|israeli-search|jbot|jbot java web robot|jcrawler|javabee|jeeves|jobo|jobo java web robot|jobot|joebot|jumpstation|kdd-explorer|kit-fireball|ko_yappo_robot|katipo|kilroy|lwp|labelgrab|labelgrabber|link validator|linkscan|linkscan server|linkscan workstation|linkwalker|linkidator|lockon|lycos|momspider|msnbot|msnbot|mac wwwworm|magpie|mattie|mediafox|merzscope|mindcrawler|monster|motor|mozilla 3.01 pbwf|mozilla|muncher|muninn|muscat ferret|muscatferret|mwd.search|mwdsearch|ndspider|nec-meshexplorer|nhse web forager|nederland.zoek|netcarta webmap engine|netmechanic|netscoop|nomad|northern light gulliver|objectssearch|occam|ontospider|open text index robot|openfind data gatherer|orb search|orbsearch|pgp key agent|pack rat|packrat|pageboy|parasite|patric|perlcrawler 1.0|perlcrawler|phantom|phpdig|piltdownman|pimptrain|pimptrain.com's robot|pioneer|plumtreewebaccessor|poppi|popular iconoclast|portal juice spider|portalb spider|portalbspider|portaljuice.com|puu|rbse spider|rhcs|raven|raven search|raven-v2|resume robot|rixbot|road runner: imagescape robot|road runner: the imagescape robot|roadhouse crawling system|robbie|robbie the robot|robocrawl|robocrawl spider|robofox|robofox v2.0|robot francoroute|robozilla|roverbot|rules|sg-scout|slcrawler|safetynet robot|scooter|search-au|search.aus-au.com|searchprocess|senrigan|shagseeker|shagseeker|shai|shai'hulud|sift|simbot|simmany robot ver1.0|site searcher|site valet|sitetech-rover|skymob.com|sleek|slurp|smart spider|snooper|solbot|spanner|speedy spider|spiderbot|spiderman|spiderman 1.0|spiderview(tm)|spiderline crawler|spry wizard robot|suke|sven|sygol|t-h-u-n-d-e-r-s-t-o-n-e|tach black widow|titan|tlspider|tarantula|tcl w3 robot|techbot|templeton|teoma|teomatechnologies|the jubii indexing robot|the nwi robot|the northstar robot|the peregrinator|the python robot|the tkwww robot|the web moose|the web wombat|the webfoot robot|the world wide web worm|titin|ucsd crawl|url check|url spider pro|udmsearch|ukonline|uptimebot|user-agent: mozilla|vwbot|vwbot_k|valkyrie|verticrawl|verticrawlbot|victoria|voyager|w3m2|wwwc|wwwc ver 0.2.5|walhello appie|wallpaper (alias crawlpaper)|web core |webbandit web spider|webbandit|webcatcher|webcopy|weblinker|webmechanic|webmirror|webmoose|webquest|webreaper|webspider|webstolperer|webvac|webwalker|webwatch|webzinger|webinator|weblog monitor|websnarf|wget|whowhere robot|wild ferret web hopper #1, #2, #3|wired digital|xget|xyleme robot|xavatoria|zilla"|awapclient|abcdatos|ahoy|ananzi|anthill|appie|arale|araneo|araybot|ariadne|arks|askjeeves|atn|atomz|auresys|bigbrother|bjaaland|blindekuh|borg-bot|boxseabot|bright.net caching robot|brightnet|bspider|cienciaficcion.net|cienciaficcion.net spider|calif|cassandra|cgireader|christcrawler|churl|cienciaficcion|cmc|combine|confuzzledbot|coolbot|cosmos|crawlpaper|cruiser|cusco|cyberspyder|cydralspider|desert realm|desertrealm|dienstspider|digger|diibot|directhit|dnabot|download_express|downloadexpress|dragonbot|dwcp|e-collector|ebiness|ecollector|elfinbot|esculapio|esther|evliyacelebi|fastcrawler|fetchrover|fido|fireball|fouineur|freecrawl|gammaspider|gammaspider, focusedcrawler|gazz|gcreep|gestalticonoclast|golem|googlebot|grabber|grapnel|griffon|gromit|gulliver|gulper|gulperbot|hambot|havindex|hometown|hotwired|ht:|htdig|html_analyzer|iajabot|iajabot|iconoclast|image.kapsi.net|imagelock|informant|infoseek|infospider|inspectorwww|irobot|javabee|jcrawler|jobo|kapsi|ko_yappo_robot|label-grabber|labelgrabber.txt|larbin|legs|linkidator|linkwalker|logo.gif|logo.gif crawler|logo_gif_crawler|magpie|marvin|mattie|mediafox|mnogosearch software|mnogosearch|moget|mouse.house|msnbot|muncher|muninn|muscatferret|myweb|netmechanic|netscoop|newscan-online|nil|nzexplorer|occam|orb_search|packrat|pageboy|parasite|patric|pegasus|perlcrawler|phpdig|piltdownman|pimptrain|pjspider|poppi|portalb|psbot|raven|rhcs|rixbot|roadrunner|robbie|robi|robocrawl|robofox|robozillaob o|rules|scooter|search-info|search_au|searchprocess|shaihulud|sharp-info-agent|sift|skymob|slurp|smartspider|snooper|solbot|speedy|spider_monkey|spiderbot|spiderline|spiderview|ssearcher|ssearcher100|straight flash!! getterroboplus 1.5|suke|suntek|sven|tach_bw|tarspider|techbot|templeton|the world wide web wanderer|titin|tlspider|topiclink|udmsearch|uptimebot|urlck|us|valkyrie|verticrawl|victoria|vision-search|void-bot|voidbot|voyager|vwbot|w3mir|w@pspider by wap4.com|w@pspider|wallpaper|wapspider|webcatcher|webfetcher|webinator|weblayers|webquest|webreader|webreaper|webs|webspider|webwalk|webwalker|wget|whatuseek winona|whatuseek_winona|whatuseek|whowhere|winona|wired-digital|wired-digital-newsbot|wlm|wlm-1.1|wolp|wwwc|wz101|xget)^$/

它应该（理论上）在我的$ str变量中看到googlebot，preg_match用strtolower（$ _ SERVER ['HTTP_USER_AGENT']），考虑$ match的增量并传递头，但似乎从来没有？这是我目前正在处理的代码，也许有人可以为我阐明它？

    //looking for this
    $query = 'klat-badge'; 

    //if not found, continue
    if(strpos($content, $query) === false) { 

        //require banlist
        require('botlist.php'); 

        //compact banlist
        $str = strtolower('/(' . implode('|', $list) .')^$/');
        $matches = array();

        //can we find a match in user agent versus banlist?
        $numMatches = preg_match($str, strtolower($_SERVER['HTTP_USER_AGENT']), $matches, 'i');

            if($numMatches > 0 || $_GET['botban'] == 'true') {

                //so tell bots we're broken
                header("Status: 503");
                header($_SERVER["SERVER_PROTOCOL"].' 503 Service Temporarily Unavailable');

                exit;

            }
    }

Answer 1

它不会从User-Agent看到“Googlebot”，因为您的列表指定为|googlebot|，并且您的正则表达式没有/i不区分大小写的修饰符

最后^$肯定也是错误的。

编辑：注意，您的'i'电话确实有preg_replace参数。那不行。 flags参数只接受整数，这些整数用于PHP包装器函数，而不是传递给PCRE正则表达式库。

Preg_matching用户代理未正确返回

1 个答案: