我正在尝试使用具有代理旋转功能的CURL模拟一个真实的浏览器请求,我对此进行了搜索,但是没有一个答案有效。
代码如下:
$url= 'https://www.stubhub.com/';
$proxy = '1.10.185.133:30207';
$userAgent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36';
$curl = curl_init();
curl_setopt( $curl, CURLOPT_URL, trim($url) );
curl_setopt($curl, CURLOPT_REFERER, trim($url));
curl_setopt( $curl, CURLOPT_RETURNTRANSFER, TRUE );
curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, TRUE );
curl_setopt( $curl, CURLOPT_CONNECTTIMEOUT, 0 );
curl_setopt( $curl, CURLOPT_TIMEOUT, 0 );
curl_setopt( $curl, CURLOPT_AUTOREFERER, TRUE );
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
$cacert='C:/xampp/htdocs/cacert.pem';
curl_setopt( $curl, CURLOPT_CAINFO, $cacert );
curl_setopt($curl, CURLOPT_COOKIEFILE,__DIR__."/cookies.txt");
curl_setopt ($curl, CURLOPT_COOKIEJAR, dirname(__FILE__) . '/cookies.txt');
curl_setopt($curl, CURLOPT_MAXREDIRS, 5);
curl_setopt( $curl, CURLOPT_USERAGENT, $userAgent );
//Headers
$header = array();
$header[] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$header[] = "Accept-Language: cs,en-US;q=0.7,en;q=0.3";
$header[] = "Accept-Encoding: utf-8";
$header[] = "Connection: keep-alive";
$header[] = "Host: www.gumtree.com";
$header[] = "Origin: https://www.stubhub.com";
$header[] = "Referer: https://www.stubhub.com";
curl_setopt( $curl, CURLOPT_HEADER, $header );
curl_setopt($curl, CURLOPT_PROXYTYPE, CURLPROXY_HTTP);
curl_setopt($curl, CURLOPT_HTTPPROXYTUNNEL, TRUE);
curl_setopt($curl, CURLOPT_PROXY, $proxy);
curl_setopt($curl, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
$data = curl_exec( $curl );
$info = curl_getinfo( $curl );
$error = curl_error( $curl );
echo '<pre>';
print_r($all);
echo '</pre>';
这是我运行脚本时得到的:
Array
(
[data] => HTTP/1.1 200 OK
HTTP/1.0 405 Method Not Allowed
Server: nginx
Content-Type: text/html; charset=UTF-8
Accept-Ranges: bytes
Expires: Thu, 01 Jan 1970 00:00:01 GMT
Cache-Control: private, no-cache, no-store, must-revalidate
Surrogate-Control: no-store, bypass-cache
Content-Length: 9411
X-EdgeConnect-MidMile-RTT: 203
X-EdgeConnect-Origin-MEX-Latency: 24
Date: Sat, 03 Nov 2018 17:15:56 GMT
Connection: close
Strict-Transport-Security: max-age=31536000; includeSubDomains
[info] => Array
(
[url] => https://www.stubhub.com/
[content_type] => text/html; charset=UTF-8
[http_code] => 405
[header_size] => 487
[request_size] => 608
[filetime] => -1
[ssl_verify_result] => 0
[redirect_count] => 0
[total_time] => 38.484
[namelookup_time] => 0
[connect_time] => 2.219
[pretransfer_time] => 17.062
[size_upload] => 0
[size_download] => 9411
[speed_download] => 244
[speed_upload] => 0
[download_content_length] => 9411
[upload_content_length] => -1
[starttransfer_time] => 23.859
[redirect_time] => 0
[redirect_url] =>
[primary_ip] => 1.10.186.132
[certinfo] => Array
(
)
[primary_port] => 42150
[local_ip] => 192.168.1.25
[local_port] => 59320
)
[error] =>
)
它和Recaptcha一样:
Due to high volume of activity from your computer, our anti-robot software has blocked your access to stubhub.com. Please solve the puzzle below and you will immediately regain access.
当我使用任何浏览器访问该网站时,都会显示该网站。
但是使用上面的脚本,不是。
那么我想让curl请求像真正的浏览器请求一样被我发现吗?
或者如果有API /库可以做到这一点,请提及。
Guzzle或类似产品会解决此问题吗?
答案 0 :(得分:0)
“那么,使curl请求像真正的浏览器请求一样,我缺少什么?”
我的猜测是他们正在使用简单的Cookie检查。 有更高级的方法,可以高度自动化地识别诸如cURL之类的自动化,尤其是与代理IP地址列表或已知Banger的IP结合使用时。
您的第一步是使用pcap或类似方法拦截浏览器发出的请求,然后尝试使用cURL复制该请求。
要检查的另一件简单的事情是您的饼干罐是否已经播种了一些种子。我通常也会这样做,因为Internet上的大多数脚本都是复制粘贴的,并且对这些细节不太关注。
可以肯定会让您从我的任何系统跳出的东西是您正在发送引荐来源,但您似乎并未真正连接到首页。您实际上是在第一次见到您的服务器上说“再次相遇”。您可能从第一次遇到时就保存了一个cookie,并且该cookie现在已经因其他操作而无效(实际上被标记为“邪恶”)。至少在开始时,总是从干净的位置复制访问顺序。
您可以尝试改编this answer,也基于cURL。 始终使用MitM SSL解码代理验证实际流量。
现在,真实答案-您需要什么信息 ?你能把它拿到别的地方吗?您能否明确要求,也许与源站点达成协议?