PHP curl返回403但不是shell命令

时间:2014-02-18 06:43:47

标签: php curl web-crawler

我正在尝试实现像facebook这样的功能,当你粘贴链接时,它会从页面中获取一些信息(h1,desc,images,...)并显示它们。

我已经遇到了几个我设法解决的问题(gzip,cookies,用户代理......)但是在这个问题上,我不确定是什么阻止了我的请求。

相关链接为http://www.mixcloud.com

这是我的PHP脚本:

protected function getContent()
{
    $ch = curl_init();
    $headers = array(
        'Accept: */*',
       // 'Accept-Encoding: gzip,deflate,sdch',
       // 'Accept-Language: en-US,en;q=0.8,es;q=0.6,fr;q=0.4,pt;q=0.2',
       // 'Cache-Control: no-cache',
       // 'Connection: keep-alive'
    );

    $debug = TRUE;

    // Set the request type
    curl_setopt($ch, CURLOPT_VERBOSE, $debug);
    curl_setopt($ch, CURLOPT_CUSTOMREQUEST, 'GET');
    curl_setopt($ch, CURLOPT_NOBODY, FALSE);
    curl_setopt($ch, CURLOPT_URL, $this->url);
    curl_setopt($ch, CURLOPT_USERAGENT, $this->userAgent);
    curl_setopt($ch, CURLOPT_REFERER, $this->referrer);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
    curl_setopt($ch, CURLOPT_HEADER, $debug);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_ENCODING , 'gzip');
    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5');
    curl_setopt($ch, CURLOPT_COOKIEJAR, '/tmp/cookies.txt');
    curl_setopt($ch, CURLOPT_COOKIEFILE, '/tmp/cookies.txt');

    $data = curl_exec($ch);

    var_dump($data);die;

    return curl_exec($ch);
}

这是详细的回复:

* Adding handle: conn: 0x7f937504e400
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7f937504e400) send_pipe: 1, recv_pipe: 0
* About to connect() to www.mixcloud.com port 80 (#0)
*   Trying 46.23.65.210...
* Connected to www.mixcloud.com (46.23.65.210) port 80 (#0)
> GET / HTTP/1.1
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5
Host: www.mixcloud.com
Accept-Encoding: gzip
Referer: https://www.google.com.au
Accept: */*

< HTTP/1.1 403 Forbidden
* Server nginx/1.5.8 is not blacklisted
< Server: nginx/1.5.8
< Date: Tue, 18 Feb 2014 06:39:45 GMT
< Content-Type: text/html
< Transfer-Encoding: chunked
< Connection: keep-alive
< Vary: Accept-Encoding
< Content-Encoding: gzip
< 
* Connection #0 to host www.mixcloud.com left intact
string(376) "HTTP/1.1 403 Forbidden\r\nServer: nginx/1.5.8\r\nDate: Tue, 18 Feb 2014 06:39:45 GMT\r\nContent-Type: text/html\r\nTransfer-Encoding: chunked\r\nConnection: keep-alive\r\nVary: Accept-Encoding\r\nContent-Encoding: gzip\r\n\r\n<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor="white">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx/1.5.8</center>\r\n</body>\r\n</html>\r\n"

现在,如果我尝试在shell中执行curl命令,它可以正常工作:

$ curl -i 'http://www.mixcloud.com' -v
* Adding handle: conn: 0x7fe28b004000
* Adding handle: send: 0
* Adding handle: recv: 0
* Curl_addHandleToPipeline: length: 1
* - Conn 0 (0x7fe28b004000) send_pipe: 1, recv_pipe: 0
* About to connect() to www.mixcloud.com port 80 (#0)
*   Trying 46.23.65.210...
* Connected to www.mixcloud.com (46.23.65.210) port 80 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.30.0
> Host: www.mixcloud.com
> Accept: */*
> 
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Tue, 18 Feb 2014 06:41:30 GMT
Date: Tue, 18 Feb 2014 06:41:30 GMT
< Content-Type: text/html; charset=utf-8
Content-Type: text/html; charset=utf-8
< Content-Length: 194847
Content-Length: 194847
< Connection: keep-alive
Connection: keep-alive
< Vary: Accept-Encoding
Vary: Accept-Encoding
* Server gunicorn/0.17.4 is not blacklisted
< Server: gunicorn/0.17.4
Server: gunicorn/0.17.4
< Vary: Cookie, User-Agent, X-Requested-With, X-Ignore-Block
Vary: Cookie, User-Agent, X-Requested-With, X-Ignore-Block
< x-xss-protection: 1; mode=block
x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
x-content-type-options: nosniff
< Set-Cookie: csrftoken=ciOosbUNp5EL8t5tiQQzkoeaJIDJ3VfO; Domain=.mixcloud.com; expires=Tue, 17-Feb-2015 06:41:30 GMT; Max-Age=31449600; Path=/
Set-Cookie: csrftoken=ciOosbUNp5EL8t5tiQQzkoeaJIDJ3VfO; Domain=.mixcloud.com; expires=Tue, 17-Feb-2015 06:41:30 GMT; Max-Age=31449600; Path=/
< Set-Cookie: eventstream=; expires=Thu, 01-Jan-1970 00:00:00 GMT; Max-Age=0; Path=/
Set-Cookie: eventstream=; expires=Thu, 01-Jan-1970 00:00:00 GMT; Max-Age=0; Path=/

< 
<!DOCTYPE html> ...

我知道PHP和cURL的cURL是不同的,但我看不出我错过了什么。 任何人吗?

干杯, 马克西姆

1 个答案:

答案 0 :(得分:2)

好的,我发现了问题所在。这是用户代理。 这真的很奇怪。我正在使用这个用户代理:

Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 GTB5

有了这个用户代理,我得到了403.我已经使用以下内容更新了它:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36

它现在运作良好。我无法相信人们仍在拒绝特定用户代理的请求......