使用curl对aspx页面进行屏幕刮擦

时间:2012-10-06 09:42:34

标签: php asp.net curl screen-scraping

我正在使用此代码,但它不起作用。请帮忙

$url = "http://www.riogrande.com/Category/Findings-and-Finished-Jewelry/132/Bails-and-Enhancers/472";
$file=file_get_contents($url);
preg_match("#.*?#mis", $file, $arr_viewstate);
$viewstate = urlencode($arr_viewstate[1]);
$eventvalidation = urlencode($arr_viewstate[2]);
$options = array(
    CURLOPT_RETURNTRANSFER => true, // return web page
    CURLOPT_HEADER => false, // don't return headers
    CURLOPT_ENCODING => "", // handle all encodings
    CURLOPT_USERAGENT => "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7'", // who am i
    CURLOPT_AUTOREFERER => true, // set referer on redirect
    CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
    CURLOPT_TIMEOUT => 1120, // timeout on response
    CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
    CURLOPT_POST => true,
    CURLOPT_VERBOSE => true,
    CURLOPT_POSTFIELDS => '__EVENTTARGET='.urlencode('ctl00$ContentPlaceHolderBody$SearchPageNavigationTop$rptPager$ctl01').'&__EVENTARGUMENT='.urlencode('').'&__VIEWSTATE='.$viewstate.'&__EVENTVALIDATION='.$eventvalidation.'&__LASTFOCUS='.urlencode('')
);

$ch = curl_init($url);
curl_setopt_array($ch,$options);

2 个答案:

答案 0 :(得分:2)

事实是,我不明白你想要达到什么,但我绝对知道这不是获得__VIEWSTATE__EVENTVALIDATION

的方式

它应该是这样的

$url = "http://www.riogrande.com/Category/Findings-and-Finished-Jewelry/132/Bails-and-Enhancers/472";
$html = file_get_contents($url);

preg_match('~<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="(.*?)" />~',$html,$viewstate);
preg_match('~<input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="(.*?)" />~',$html,$eventvalidation);

$viewstate = $viewstate[1];
$eventvalidation = $eventvalidation[1] ;

var_dump($viewstate,$eventvalidation);

答案 1 :(得分:0)

此代码似乎正在运行...将该代码粘贴到空白的php文件中,我获取了目标URL的内容。但是,图像被破坏,样式表未被拉出,javascript无效。

抓取整个网页的问题,特别是那些使用相对网址,图片,CSS,JavaScript等的网页将无法按预期工作。

如果您坚持抓取页面并吐出结果,请尝试将最后几行代码替换为:

$result = curl_exec($ch); 
curl_close($ch);

$result = str_replace("../../../../","http://www.riogrande.com/",$result);
echo $result;

我刚刚注意到相对网址以../../../../开头,因此将其设为绝对网址可能有助于正确加载图片。