如何使用CURL模拟来自浏览器的真实请求?

时间:2018-11-03 19:34:18

标签: php curl web-scraping proxy http-headers

我正在尝试使用具有代理旋转功能的CURL模拟一个真实的浏览器请求,我对此进行了搜索,但是没有一个答案有效。

代码如下:

$url= 'https://www.stubhub.com/';
$proxy = '1.10.185.133:30207';
$userAgent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36';

$curl = curl_init();
curl_setopt( $curl, CURLOPT_URL, trim($url) );
curl_setopt($curl, CURLOPT_REFERER, trim($url));
curl_setopt( $curl, CURLOPT_RETURNTRANSFER, TRUE );
curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, TRUE );
curl_setopt( $curl, CURLOPT_CONNECTTIMEOUT, 0 );
curl_setopt( $curl, CURLOPT_TIMEOUT, 0 );
curl_setopt( $curl, CURLOPT_AUTOREFERER, TRUE );
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, FALSE);
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
$cacert='C:/xampp/htdocs/cacert.pem';
curl_setopt( $curl, CURLOPT_CAINFO, $cacert );
curl_setopt($curl, CURLOPT_COOKIEFILE,__DIR__."/cookies.txt");
curl_setopt ($curl, CURLOPT_COOKIEJAR, dirname(__FILE__) . '/cookies.txt');
curl_setopt($curl, CURLOPT_MAXREDIRS, 5);
curl_setopt( $curl, CURLOPT_USERAGENT, $userAgent );

//Headers
$header = array();
$header[] = "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$header[] = "Accept-Language: cs,en-US;q=0.7,en;q=0.3";
$header[] = "Accept-Encoding: utf-8";
$header[] = "Connection: keep-alive";
$header[] = "Host: www.gumtree.com";
$header[] = "Origin: https://www.stubhub.com";
$header[] = "Referer: https://www.stubhub.com";

curl_setopt( $curl, CURLOPT_HEADER, $header );
curl_setopt($curl, CURLOPT_PROXYTYPE, CURLPROXY_HTTP);
curl_setopt($curl, CURLOPT_HTTPPROXYTUNNEL, TRUE);
curl_setopt($curl, CURLOPT_PROXY, $proxy);
curl_setopt($curl, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
$data = curl_exec( $curl );
$info = curl_getinfo( $curl );
$error = curl_error( $curl );

echo '<pre>';
print_r($all);
echo '</pre>';

这是我运行脚本时得到的:

Array
(
    [data] => HTTP/1.1 200 OK

HTTP/1.0 405 Method Not Allowed
Server: nginx
Content-Type: text/html; charset=UTF-8
Accept-Ranges: bytes
Expires: Thu, 01 Jan 1970 00:00:01 GMT
Cache-Control: private, no-cache, no-store, must-revalidate
Surrogate-Control: no-store, bypass-cache
Content-Length: 9411
X-EdgeConnect-MidMile-RTT: 203
X-EdgeConnect-Origin-MEX-Latency: 24
Date: Sat, 03 Nov 2018 17:15:56 GMT
Connection: close
Strict-Transport-Security: max-age=31536000; includeSubDomains

[info] => Array
        (
            [url] => https://www.stubhub.com/
            [content_type] => text/html; charset=UTF-8
            [http_code] => 405
            [header_size] => 487
            [request_size] => 608
            [filetime] => -1
            [ssl_verify_result] => 0
            [redirect_count] => 0
            [total_time] => 38.484
            [namelookup_time] => 0
            [connect_time] => 2.219
            [pretransfer_time] => 17.062
            [size_upload] => 0
            [size_download] => 9411
            [speed_download] => 244
            [speed_upload] => 0
            [download_content_length] => 9411
            [upload_content_length] => -1
            [starttransfer_time] => 23.859
            [redirect_time] => 0
            [redirect_url] => 
            [primary_ip] => 1.10.186.132
            [certinfo] => Array
                (
                )

            [primary_port] => 42150
            [local_ip] => 192.168.1.25
            [local_port] => 59320
        )

    [error] => 
)

它和Recaptcha一样:

Due to high volume of activity from your computer, our anti-robot software has blocked your access to stubhub.com. Please solve the puzzle below and you will immediately regain access.

当我使用任何浏览器访问该网站时,都会显示该网站。

但是使用上面的脚本,不是。

那么我想让curl请求像真正的浏览器请求一样被我发现吗?

或者如果有API /库可以做到这一点,请提及。

Guzzle或类似产品会解决此问题吗?

1 个答案:

答案 0 :(得分:0)

“那么,使curl请求像真正的浏览器请求一样,我缺少什么?”

我的猜测是他们正在使用简单的Cookie检查。 有更高级的方法,可以高度自动化地识别诸如cURL之类的自动化,尤其是与代理IP地址列表或已知Banger的IP结合使用时。

您的第一步是使用pcap或类似方法拦截浏览器发出的请求,然后尝试使用cURL复制该请求。

要检查的另一件简单的事情是您的饼干罐是否已经播种了一些种子。我通常也会这样做,因为Internet上的大多数脚本都是复制粘贴的,并且对这些细节不太关注。

可以肯定会让您从我的任何系统跳出的东西是您正在发送引荐来源,但您似乎并未真正连接到首页。您实际上是在第一次见到您的服务器上说“再次相遇”。您可能从第一次遇到时就保存了一个cookie,并且该cookie现在已经因其他操作而无效(实际上被标记为“邪恶”)。至少在开始时,总是从干净的位置复制访问顺序。

您可以尝试改编this answer,也基于cURL。 始终使用MitM SSL解码代理验证实际流量

现在,真实答案-您需要什么信息 ?你能把它拿到别的地方吗?您能否明确要求,也许与源站点达成协议?