当我使用curl在电子商务网站上获取页面时,它总是给我相同的首页(忽略起始项参数);而当我在浏览器中访问网址时,它会像往常一样工作。
简化代码:
// s is the starting item count, no idea what yp4p_page is for exactly yet.
$url = 'http://list.taobao.com/market/baobao.htm?cat=40&yp4p_page=4&s=176';
$ch = curl_init($url);
$header[0] = 'Accept: text/xml,application/xml,application/xhtml+xml,'
. 'text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Cache-Control: max-age=0';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
//$cookieFile = tempnam('/tmp', 'curlcookie');
$cookieFile = dirname(__FILE__) . DIRECTORY_SEPARATOR . 'curlcookies.txt';
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => false,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_ENCODING => 'gzip,deflate',
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0 FirePHP/0.6',
CURLOPT_AUTOREFERER => true,
CURLOPT_CONNECTTIMEOUT => 120,
CURLOPT_TIMEOUT => 120,
CURLOPT_MAXREDIRS => 10,
CURLOPT_SSL_VERIFYHOST => 0,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_VERBOSE => 1,
CURLOPT_HTTPHEADER => $header,
CURLOPT_COOKIEFILE => $cookieFile,
CURLOPT_COOKIEJAR => $cookieFile,
);
curl_setopt_array($ch, $options);
$strPageHTML = curl_exec($ch);
curl_close($ch);
我很抱歉中文网站,但如果你查看列出的项目和curl返回的网址,他们的ID总是与首页上的那些(其中s = 0)相同与众不同。
我做错了什么?
编辑1:在代码中添加了cookie,仍然无效。
编辑2:编辑cookie行以清除任何混淆。 Cookie的内容如下:
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
#HttpOnly_.taobao.com TRUE / FALSE 0 cookie2 d686d4be95b4b56b61292118b43e1333
#HttpOnly_.taobao.com TRUE / FALSE 1316321978 _tb_token_ eeab7e3e5ea9e
.taobao.com TRUE / FALSE 1321505978 t 3c473872e51e93b0cf172375b31f503a
.taobao.com TRUE / FALSE 0 uc1 cookie14=UoLdHCGrCsSKAg%3D%3D
.taobao.com TRUE / FALSE 0 v 0
.taobao.com TRUE / FALSE 0 _lang zh_CN:GBK
答案 0 :(得分:3)
您应该查看网站生成的Cookie,甚至是一些可以插入的CSRF令牌,以防止您进行某些解析工作。 当我在第一次加载时检查网页时,我可以找到:
Set-Cookie:cookie2=b1d92ddac8aa82350a6ff5e892a8637d;Domain=.taobao.com;Path=/;HttpOnly
_tb_token_=fde3979ee6b13;Domain=.taobao.com;Path=/;Expires=Sat, 17-Sep-2011 07:09:40 GMT;HttpOnly
t=91f29eb410a21a04bf36025823c4b2ad; Domain=.taobao.com; Expires=Wed, 16-Nov-2011 07:09:40 GMT; Path=/
uc1=cookie14=UoLdHCDBHbn1eg%3D%3D; Domain=.taobao.com; Path=/
也许这些Cookie用于在浏览类别时识别您的身份。
在DOM中搜索“token”也会产生一些结果。
答案 1 :(得分:2)
可以通过api(http://open.taobao.com/)访问您需要的信息,而不是假装成用户来访问该页面吗?
答案 2 :(得分:1)
这个页面使用了很多cookie,我不会感到惊讶,加载页面需要会话cookie。看看启用
时会发生什么curl_setopt($DATA_POST, CURLOPT_COOKIEFILE, 'cookiefile.txt');
curl_setopt($DATA_POST, CURLOPT_COOKIEJAR, 'cookiefile.txt');
答案 3 :(得分:1)
// s is the starting item count, no idea what yp4p_page is for exactly yet.
$url = 'http://list.taobao.com/market/baobao.htm?cat=40&yp4p_page=4&s=176';
$ch = curl_init($url);
$header[0] = 'Accept: text/xml,application/xml,application/xhtml+xml,'
. 'text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Cache-Control: max-age=0';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
$cookieFile = "cookie_china"; // I've changed this value and it seems to be working fine, I get the same results
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => false,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_ENCODING => 'gzip,deflate',
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0 FirePHP/0.6',
CURLOPT_AUTOREFERER => true,
CURLOPT_CONNECTTIMEOUT => 120,
CURLOPT_TIMEOUT => 120,
CURLOPT_MAXREDIRS => 10,
CURLOPT_SSL_VERIFYHOST => 0,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_VERBOSE => 1,
CURLOPT_HTTPHEADER => $header,
CURLOPT_COOKIEFILE => $cookieFile,
CURLOPT_COOKIEJAR => $cookieFile,
);
curl_setopt_array($ch, $options);
$strPageHTML = curl_exec($ch);
curl_close($ch);