我试图用php cURL抓取此页面:http://www.newhorizonssc.com/localweb/catalog/coursecatalog.aspx?GroupId=402&keyword=infopath
然而,当我通过cURL运行该url并回显结果时,我得到了页面的轮廓,但是页面中间的数据表丢失了。
在浏览器中转到url的结果:
通过cURL运行网址后的结果:
然而,当我在Firebug中查看请求的HTML页面时,我看到的空白区域的结果也是我所看到的(所以如果匹配它,我的标题可能会很好吗?):
显然,当它没有显示时,我无法从表格中删除数据。
我一整天都在尝试,通过这里提出的问题,链接的教程,谷歌。很明显,使用php cURL和aspx表单访问数据并不是最简单的,但是没有任何工作。
首先我会认为,因为我可以添加"?GroupId = 402& keyword = infopath"到URL的末尾,一个简单的GET就可以了。但是,由于它不是,我认为必须进行某种验证或正在进行的事情。
我非常确定我拥有所有正确的标头信息。但是我注意到在那个吐出好结果的页面上,有24个XHR请求,在我的页面上有cURL,有0个。我想我不知怎的,我应该做一个AJAX调用提起那张桌子,但我很失落如何做到这一点 ---此外,如果我确实要显示此表,我还需要进行ajax调用以模拟单击小加按钮,这将进行ajax调用并显示每个课程下的类列表。
以下是我使用的整个cURL函数:
private function __curl($url) {
$nameCourseSearch='ctl00$uxContentBody$txtSearch';
$valCourseSearch = 'infopath';
$nameSearchBtn = 'ctl00$uxContentBody$btnSearch';
$valSearchBtn = 'GO';
// the path to a file we can read/write; this will
// store cookies we need for accessing secured pages
$cookieFile = 'cookie.txt';
// regular expressions to parse out the special ASP.NET
// values for __VIEWSTATE and __EVENTVALIDATION
$regexViewstate = '/__VIEWSTATE\" value=\"(.*)\"/i';
$regexEventVal = '/__EVENTVALIDATION\" value=\"(.*)\"/i';
/************************************************
* utility function: regexExtract
* use the given regular expression to extract
* a value from the given text; $regs will
* be set to an array of all group values
* (assuming a match) and the nthValue item
* from the array is returned as a string
************************************************/
function regexExtract($text, $regex, $regs, $nthValue)
{
if (preg_match($regex, $text, $regs)) {
$result = $regs[$nthValue];
}
else {
$result = "";
}
return $result;
}
$ch = curl_init();
/************************************************
* first, issue a GET call to the ASP.NET login
* page. This is necessary to retrieve the
* __VIEWSTATE and __EVENTVALIDATION values
* that the server issues
************************************************/
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
$data=curl_exec($ch);
// from the returned html, parse out the __VIEWSTATE and
// __EVENTVALIDATION values
$viewstate = regexExtract($data,$regexViewstate,$regs,1);
$eventval = regexExtract($data, $regexEventVal,$regs,1);
$postData = array(
'__VIEWSTATE'=>rawurlencode($viewstate),
'__EVENTVALIDATION'=>rawurlencode($eventval),
'ctl00_ContentPlaceHolder1_tc1_ClientState' => '{"ActiveTabIndex":0,"TabState":[true,true]}', $nameCourseSearch =>$valCourseSearch,
$nameSearchBtn =>$valSearchBtn,
);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/4");
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
//testing asp options
curl_setOpt($ch, CURLOPT_POST, TRUE);
curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($postData));
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
$data = curl_exec($ch);
/************************************************
* with the authentication cookie in the jar,
* we'll now issue a GET to the secured page;
* we set curl's COOKIEFILE option to the same
* file we used for the jar before to ensure the
* authentication cookie is sent back to the
* server
************************************************/
curl_setOpt($ch, CURLOPT_POST, FALSE);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile);
$data = curl_exec($ch);
//$result = curl_exec($ch);
if(!$data) {
echo "<br />cURL error number: ".curl_errno($ch);
echo "<br />cURL erro: ".curl_error($ch). " on URL - ". $url;
var_dump(curl_getinfo($ch));
var_dump(curl_error($ch));
exit;
}
return $data;
}
抱歉这么长时间。只是想确保我发布了我认为需要的所有信息。
感谢。