Curl正在获取与aspx站点预期不同的数据

时间:2015-04-20 07:11:33

标签: php asp.net curl

我正在尝试从here获取数据,为此我使用了以下代码。但它正在给我们在浏览器中找到的不同结果。我不知道为什么会这样。请帮我。此外,日志文件和cookie文件中没有内容。 我的代码:

<?php
function curl($url ,$binary=false,$post=false,$cookie =false ){
    touch($cookie);

$ch = curl_init();

        curl_setopt ($ch, CURLOPT_URL, $url );
        curl_setopt ($ch, CURLOPT_HEADER, 0);
        curl_setopt($ch, CURLINFO_HEADER_OUT, true);
        curl_setopt($ch, CURLOPT_REFERER, $url);
        curl_setopt($ch, CURLOPT_AUTOREFERER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_TIMEOUT, 60);
        curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);



        if($cookie){

            $agent = "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)";
            curl_setopt($ch, CURLOPT_USERAGENT, $agent);
            curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
            curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);

        }


        if($binary)
            curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);

        if($post){
            curl_setopt($ch, CURLOPT_POST, true);
            curl_setopt($ch, CURLOPT_POSTFIELDS, $post);
        }

     return  curl_exec ($ch);
     echo curl_getinfo($ch, CURLINFO_HEADER_OUT);
}
$dist=01; 
$assem=98;
$ok="Proceed";
$url="http://164.100.153.3/e-registration/booth_entry_report.aspx"; 
$cookie="cookie.txt";

$f = fopen('log.txt', 'w');
touch($cookie);
$useragent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/36.0.1985.125 Chrome/36.0.1985.125 Safari/537.36';


$ch = curl_init($url);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);

$html = curl_exec($ch);

curl_close($ch);

preg_match('~<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="(.*?)" />~', $html, $viewstate);
preg_match('~<input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="(.*?)" />~', $html, $eventValidation);

$viewstate = $viewstate[1];
$eventValidation = $eventValidation[1];

$postdata = "ddldistrict=".$dist."&ddlassembly=".$assem."&btnproceed=".$ok."&VIEWSTATE=".$viewstate."&EVENTVALIDATION".$eventValidation; 

// function
$ch = curl($url,false,$postdata,$cookie);
//$url ='http://164.100.153.3/e-registration/booth_level_officer_report.aspx';
//$cookie="cookie.txt"; 

//$ch =curl($url,false,false,$cookie);

echo $ch;
?>

浏览器的真实结果: enter image description here 卷曲回归的不同结果: enter image description here

1 个答案:

答案 0 :(得分:0)

您正在抓取的页面需要更多POST参数(Firebug告诉我)。该表单也发布了“__VIEWSTATE”和“__EVENTVALIDATION”(还有一些,所以也请检查一下)。

此外,表单被提交到同一页面“booth_entry_report.aspx”,然后它被REDIRECTED(代码302)到“booth_level_officer_report.aspx”页面。所以你不需要第二个curl()因为第一个已经有了CURLOPT_FOLLOWLOCATION。

此外,发送的标头可能存在一些问题。我建议您使用Firebug或更好的Fiddler来查看浏览器发送的请求,并将其与php curl发送的信息进行比较。

要查看php curl发送的标头,请调用curl_setopt($ ch,CURLINFO_HEADER_OUT,true),然后在curl_exec()之后回显curl_getinfo($ ch,CURLINFO_HEADER_OUT)。

<强>更新
确保'__'在那些post params之前。

将ddldistrict和ddlassembly设置为字符串('01'和'98')

我建议您通过该页面执行的所有页面加载:

  1. 获取页面并阅读'__'参数
  2. 将referer和POST设置为同一页面,其中包含所有'_'参数+ ddldistrict ='01',并再次阅读'__'参数
  3. 使用这些参数再次将referer和POST设置为同一页面+ ddldistrict ='01'+ ddlassembly ='98'+ btnproceed ='继续'
  4. 更新2代码:
    以下是我上面所说的代码。但是我恐怕无法帮助你。该页面看起来确实很棘手。祝你好运!

    $dist='01'; 
    $assem='98';
    $ok="Proceed";
    $url="http://164.100.153.3/e-registration/booth_entry_report.aspx"; 
    $cookie="cookie.txt";
    
    /////// get the first page
    $ch = curl_init($url);
    // here curl_setopt() for url, cookie, useragent, followlocation, etc
    $html = curl_exec($ch);
    curl_close($ch);
    
    // get those variables with preg_match
    preg_match('... __VIEWSTATE ...', $html, $viewstate);
    $viewstate = $viewstate[1];
    
    preg_match('... __EVENTVALIDATION ...', $html, $eventValidation);
    $eventValidation = $eventValidation[1];
    
    // do the preg_match for '__EVENTTARGET', '__EVENTARGUMENT', '__LASTFOCUS'
    
    
    //////// do the next request
    // the first post data with 'dllassembly=0' and without 'btnprocess'. Like when you selected the "District name" in the browser.
    $postdata = "ddldistrict=".$dist."&ddlassembly=0&__VIEWSTATE=".$viewstate."&__EVENTVALIDATION".$eventValidation // add the rest of the __ fields too
    
    // post it with referer
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_REFERER, $url); // this with referer
    // here curl_setopt() for post data, url, cookie, useragent, followlocation, etc
    $html = curl_exec($ch);
    curl_close($ch);
    
    // here get all the '__' fields again with preg_match(), just like last time
    preg_match('... __VIEWSTATE ...', $html, $viewstate);
    $viewstate = $viewstate[1];
    
    preg_match('... __EVENTVALIDATION ...', $html, $eventValidation);
    $eventValidation = $eventValidation[1];
    
    // do the preg_match for '__EVENTTARGET', '__EVENTARGUMENT', '__LASTFOCUS'
    
    
    //////// do the last request
    // the second post data with 'dllassembly=98' and also 'btnprocess'. Like when you clicked "Procees" in the browser.
    $postdata = "ddldistrict=".$dist."&ddlassembly=".$assem."&btnproceed=".$ok."&__VIEWSTATE=".$viewstate."&__EVENTVALIDATION".$eventValidation // add the rest of the __ fields too
    
    // and finally post it with referer
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_REFERER, $url); // this with referer
    // here curl_setopt() for post data, url, cookie, useragent, followlocation, etc
    $html = curl_exec($ch);
    curl_close($ch);
    

    关于邮政参数创建的注意事项。更好的方法是创建一个关联数组,并使用http_build_query()。喜欢这个

    $post_data = array(
        'ddldistrict' => '01',
        '__EVENTVALIDATION' => $eventValidation,
        '....' => '....'
    );
    
    curl_setopt($ch, CURLOPT_POSTFIELDS, http_build_query($post_data));