Firefox确实没有获得预期的页面内容。可能的原因?

时间:2011-08-18 07:05:59

标签: php curl

当我使用curl在电子商务网站上获取页面时,它总是给我相同的首页(忽略起始项参数);而当我在浏览器中访问网址时,它会像往常一样工作。

简化代码:

// s is the starting item count, no idea what yp4p_page is for exactly yet.
$url = 'http://list.taobao.com/market/baobao.htm?cat=40&yp4p_page=4&s=176';

$ch = curl_init($url);

$header[0] = 'Accept: text/xml,application/xml,application/xhtml+xml,'
                . 'text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Cache-Control: max-age=0';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';

//$cookieFile = tempnam('/tmp', 'curlcookie');
$cookieFile = dirname(__FILE__) . DIRECTORY_SEPARATOR . 'curlcookies.txt';

$options = array(
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_HEADER => false,
            CURLOPT_FOLLOWLOCATION => true,
            CURLOPT_ENCODING => 'gzip,deflate',
            CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0 FirePHP/0.6',
            CURLOPT_AUTOREFERER => true,
            CURLOPT_CONNECTTIMEOUT => 120,
            CURLOPT_TIMEOUT => 120, 
            CURLOPT_MAXREDIRS => 10, 
            CURLOPT_SSL_VERIFYHOST => 0,
            CURLOPT_SSL_VERIFYPEER => false, 
            CURLOPT_VERBOSE => 1,
            CURLOPT_HTTPHEADER => $header,
            CURLOPT_COOKIEFILE => $cookieFile,
            CURLOPT_COOKIEJAR => $cookieFile,
);

curl_setopt_array($ch, $options);

$strPageHTML = curl_exec($ch);

curl_close($ch);

我很抱歉中文网站,但如果你查看列出的项目和curl返回的网址,他们的ID总是与首页上的那些(其中s = 0)相同与众不同。

我做错了什么?

编辑1:在代码中添加了cookie,仍然无效。

编辑2:编辑cookie行以清除任何混淆。 Cookie的内容如下:

# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.

#HttpOnly_.taobao.com   TRUE    /   FALSE   0   cookie2 d686d4be95b4b56b61292118b43e1333
#HttpOnly_.taobao.com   TRUE    /   FALSE   1316321978  _tb_token_  eeab7e3e5ea9e
.taobao.com TRUE    /   FALSE   1321505978  t   3c473872e51e93b0cf172375b31f503a
.taobao.com TRUE    /   FALSE   0   uc1 cookie14=UoLdHCGrCsSKAg%3D%3D
.taobao.com TRUE    /   FALSE   0   v   0
.taobao.com TRUE    /   FALSE   0   _lang   zh_CN:GBK

4 个答案:

答案 0 :(得分:3)

您应该查看网站生成的Cookie,甚至是一些可以插入的CSRF令牌,以防止您进行某些解析工作。 当我在第一次加载时检查网页时,我可以找到:

Set-Cookie:cookie2=b1d92ddac8aa82350a6ff5e892a8637d;Domain=.taobao.com;Path=/;HttpOnly
_tb_token_=fde3979ee6b13;Domain=.taobao.com;Path=/;Expires=Sat, 17-Sep-2011 07:09:40     GMT;HttpOnly
t=91f29eb410a21a04bf36025823c4b2ad; Domain=.taobao.com; Expires=Wed, 16-Nov-2011 07:09:40 GMT; Path=/
uc1=cookie14=UoLdHCDBHbn1eg%3D%3D; Domain=.taobao.com; Path=/

也许这些Cookie用于在浏览类别时识别您的身份。

在DOM中搜索“token”也会产生一些结果。

答案 1 :(得分:2)

可以通过api(http://open.taobao.com/)访问您需要的信息,而不是假装成用户来访问该页面吗?

答案 2 :(得分:1)

这个页面使用了很多cookie,我不会感到惊讶,加载页面需要会话cookie。看看启用

时会发生什么
curl_setopt($DATA_POST, CURLOPT_COOKIEFILE, 'cookiefile.txt'); 
curl_setopt($DATA_POST, CURLOPT_COOKIEJAR, 'cookiefile.txt');

答案 3 :(得分:1)

// s is the starting item count, no idea what yp4p_page is for exactly yet.
$url = 'http://list.taobao.com/market/baobao.htm?cat=40&yp4p_page=4&s=176';

$ch = curl_init($url);

$header[0] = 'Accept: text/xml,application/xml,application/xhtml+xml,'
                . 'text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Cache-Control: max-age=0';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';

$cookieFile = "cookie_china"; // I've changed this value and it seems to be working fine, I get the same results

$options = array(
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_HEADER => false,
            CURLOPT_FOLLOWLOCATION => true,
            CURLOPT_ENCODING => 'gzip,deflate',
            CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0 FirePHP/0.6',
            CURLOPT_AUTOREFERER => true,
            CURLOPT_CONNECTTIMEOUT => 120,
            CURLOPT_TIMEOUT => 120, 
            CURLOPT_MAXREDIRS => 10, 
            CURLOPT_SSL_VERIFYHOST => 0,
            CURLOPT_SSL_VERIFYPEER => false, 
            CURLOPT_VERBOSE => 1,
            CURLOPT_HTTPHEADER => $header,
            CURLOPT_COOKIEFILE => $cookieFile,
            CURLOPT_COOKIEJAR => $cookieFile,
);

curl_setopt_array($ch, $options);

$strPageHTML = curl_exec($ch);

curl_close($ch);