出于好奇,我试图解析html
$url = "http://www.continente.pt/stores/continente/pt-pt/public/Pages/subcategory.aspx?cat=Bebidas_Vinhos";
$agent= 'Googlebot-Image/1.0 Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
var_dump($result);
从商店超市网站,我收到此消息
错误 此页面无法显示。联系支持以获取更多信息。 事件ID为:N / A.
我发现它很奇怪,他们对这种类型的攻击"有一些保护,但他们如何保护这个网站以及他们如何让google bot爬行以进行数字营销?
答案 0 :(得分:0)
尝试使用会话Cookie,但此页面没有内容,因为它是使用ajax加载异步。
curl 'http://www.continente.pt/stores/continente/pt-pt/public/Pages/subcategory.aspx?cat=Bebidas_Vinhos' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: es-ES,es;q=0.8,en-US;q=0.5,en;q=0.3' -H 'Connection: keep-alive' -H 'Cookie: searchRefiner=%7B%22%22%3A%7B%221449049672079%22%3A%5B%5D%7D%7D; SPSessionGuid=ec3e4a3e-7cfe-4c8a-902f-1c64ba0868f4; __CommerceAnonymousShopper_ef77e72d-62b9-4b0f-8113-d111c9d6d7ce_Internet=0244rfNRN5rPgC7kvXzyqrNQg==WBGr/AUg99sKnXpF3QH4Sa5cHPFred5bJqPiwbFvDnL1jHUk6v0Jb0dpOZLY66bXpC8faWF7k5aOMi/qIkOgA4RNWuskMnicr6OJ12BBs8ns68kXmckzTJvkVEfDQB7DApeN5ULier028VPSLkChmWvBHyCHno328U6SrLu65m5e3lu521PF940napZPZIvN7hP51Yfi9c+FkwjIAZ+j8w==; MSCSProfile=287001FD2674671C70ED37E496ED003312D0DA42BDDB218BA1D2B71AD462488CF83AD1F7530553A13FDD4C8DB0E26123D3A02CCFBA6DAE49B72A185609583B9617878CEA5D73023FE7A74384436D54761511ED87FFA2AF58124E143C0E90DC9C72D55A51B3AE6EAB71153682F607FE3C29538E729117E4DD3D6B05C06E7FBA47; cPrompt_useCookies=1; cpup=2; _ga=GA1.2.532033017.1449049672; _dc_gtm_UA-158387-26=1; byside_webcare_tuid=5110f1jvvitrsyi82c2q4kddcxlrl0vdwfmrmtzeah679ditkl; __atuvc=1%7C48; __atuvs=565ebe4c6d710bda000; CampaignHistory=146148' -H 'Host: www.continente.pt' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0'