curl无法将数据发布到php页面

时间:2014-12-06 11:21:25

标签: php asp.net curl web-scraping libcurl

我正在尝试将值发布到表单以从aspx页面获取数据,但我无法获取数据

  1. 要打开的第一个网址: $url="https://www.clarkcountycourts.us/Anonymous/default.aspx"(此处设置了Cookie)

  2. 我们要搜索数据的第二个网址 $url3 ="https://www.clarkcountycourts.us/Anonymous/Search.aspx"

  3. 当我发布数据时,我只获得第3页的html而不是第2页的数据

    以下是我的代码,我错过了什么?请指导我。

    <?php 
    $url="https://www.clarkcountycourts.us/Anonymous/default.aspx"; 
    $cookie="cookie.txt";
    $url3 ="https://www.clarkcountycourts.us/Anonymous/Search.aspx";
    
    $ch = curl_init(); 
    curl_setopt ($ch, CURLOPT_URL, $url); 
    curl_setopt ($ch, CURLOPT_SSL_VERIFYPEER, FALSE); 
    curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"); 
    curl_setopt ($ch, CURLOPT_TIMEOUT, 60); 
    curl_setopt ($ch, CURLOPT_FOLLOWLOCATION, 1); 
    curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1); 
    curl_setopt ($ch, CURLOPT_COOKIEJAR, $cookie); 
    curl_setopt ($ch, CURLOPT_COOKIEFILE, $cookie);  // <-- add this line
    curl_setopt ($ch, CURLOPT_REFERER, $url); 
    $result = curl_exec ($ch); 
    
    
    
    $SearchBy = "2";
    $AttorneySearchMode="Name";
    $LastName= "Smith";
    $FirstName= "William";
    $MiddleName ="";
    $CaseStatusType= "0";
    $SortBy= "fileddate";
    $DateFiledOnAfter = "";
    $DateFiledOnBefore = "";
    $SearchSubmit ="Search";
    
    $fields = array(
        'SearchBy' => urlencode($SearchBy),
        'AttorneySearchMode' => urlencode($AttorneySearchMode),
        'LastName' => urlencode($LastName),
        'FirstName' => urlencode($FirstName),
        'MiddleName' => urlencode($MiddleName),
        'CaseStatusType' => urlencode($CaseStatusType),
        'SortBy' => urlencode($SortBy),
        'DateFiledOnAfter' => urlencode($DateFiledOnAfter),
        'DateFiledOnBefore' => urlencode($DateFiledOnBefore),
        'SearchSubmit' => urlencode($SearchSubmit)
    );
    
    $fields_string = "";
    foreach ($fields as $key=>$value) {
        $fields_string .= $key.'='.$value.'&';
    }
    $fields_string = rtrim($fields_string, '&');
    
    curl_setopt($ch, CURLOPT_URL, $url3);
    curl_setopt($ch, CURLOPT_HEADER, true );
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch,CURLOPT_POST, 1);
    curl_setopt($ch,CURLOPT_POSTFIELDS, $fields_string);
    $result2 = curl_exec ($ch); 
    print_r($result2);
    

3 个答案:

答案 0 :(得分:5)

这里有两件事情。

  1. 您没有正确使用cookie jar。当您想要存储从服务器设置的cookie时,使用CURLOPT_COOKIEJAR。 cURL不像这样的可编程浏览器。如果您想要建立会话(cookie),则第一个请求使用CURLOPT_COOKIEJAR,后续请求将使用CURLOPT_COOKIEFILE

    CURLOPT_COOKIEJAR个Cookie放入来自服务器的jar中 {cookie}将cookie文件从发送回服务器。

    在第一次连接服务器之后使用CURLOPT_COOKIEFILE,每次都会覆盖cookie jar。

  2. Cookie和cURL一直是我沮丧的源头,所以我可能会就这第一点做出更大的贡献。

    1. 当我在此服务器上手动执行搜索时,表单最终会发布到这样的网址:

      https://www.clarkcountycourts.us/Anonymous/Search.aspx?ID=200&NodeID=101,103,104,105,500,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,797,798&NodeDesc=All%20Courts

      此外,当我在浏览器中发布表单时,还会发送许多其他字段。

    2. 其中一些看起来很重要 - CURLOPT_COOKIEJAR__VIEWSTATE__EVENTVALIDATION

      __VIEWSTATEGENERATOR

      结论

      我相信您需要加载搜索页面以获取值以放入其中一些隐藏字段(解析页面),然后再次将搜索页面作为POST填充,并填入隐藏字段。

答案 1 :(得分:5)

花了一段时间来解决这个问题 - 但是下面的代码应该根据您对William Smith作为律师的原始查询在搜索后回显结果。希望这是有道理的......

            define('ROOT','c:/wwwroot');
            $url_base='https://www.clarkcountycourts.us';
            $url_login=$url_base.'/Anonymous/Login.aspx?ReturnUrl=/Anonymous/default.aspx';
            $url_start=$url_base.'/Anonymous/default.aspx';
            $url_search=$url_base.'/Anonymous/Search.aspx?ID=200&NodeID=101,103,104,105,500,600,601,602,603,604,605,606,607,608,609,610,611,612,613,614,615,616,617,618,619,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,797,798&NodeDesc=All%20Courts';
            $cookiejar=tempnam( sys_get_temp_dir(), 'cookiejar_' );

            $lastname='smith';
            $firstname='william';

            $formparams=array(
                '__VIEWSTATEGENERATOR',
                '__EVENTVALIDATION',
                '__EVENTTARGET',
                '__EVENTARGUMENT',
                '__VIEWSTATE'
            );
            $stdparams=array(
                'SearchBy' => '2',
                'AttorneySearchMode' => 'Name',
                'LastName' => $lastname,
                'FirstName' => $firstname,
                'MiddleName' => '',
                'CaseStatusType' => '0',
                'SortBy' => 'fileddate',
                'DateFiledOnAfter' => '',
                'DateFiledOnBefore' => '',                  
                'SearchSubmit' => 'Search',
                'CaseSearchMode' => 'CaseNumber',
                'CaseSearchValue' => '',
                'CitationSearchValue' => '',
                'CourtCaseSearchValue' => '',
                'PartySearchMode' => 'Name',
                'cboState' => 'AA',
                'DateOfBirth' => '',
                'DriverLicNum' => '',
                'chkCriminal' => 'on',
                'chkFamily' => 'on',
                'chkCivil' => 'on',
                'chkProbate' => 'on',
                'chkDtRangeCriminal' => 'on',
                'chkDtRangeFamily' => 'on',
                'chkDtRangeCivil' => 'on',
                'chkDtRangeProbate' => 'on',
                'chkCriminalMagist' => 'on',
                'chkFamilyMagist' => 'on',
                'chkCivilMagist' => 'on',
                'chkProbateMagist' => 'on',
                'DateSettingOnAfter' => '',
                'DateSettingOnBefore' => '',
                'SearchType' => 'PARTY',
                'SearchMode' => 'NAME',
                'NameTypeKy' => 'ALIAS',
                'BaseConnKy' => 'AT',
                'StatusType' => 'true',
                'ShowInactive' => '',
                'AllStatusTypes' => 'true',
                'CaseCategories' => '',
                'RequireFirstName' => 'False',
                'CaseTypeIDs' => '',
                'HearingTypeIDs' => ''
            );
            /* A fudge I know, manually build this param - I think it might be constructed using javascript on submission */
            $stdparams['SearchParams']='SearchBy~~Search+By:~~Attorney~~Attorney||AttorneyNameOption~~Party+Search+Mode:~~Name~~Name||LastName~~Last+Name:~~'.$lastname.'~~'.$lastname.'||FirstName~~First+Name:~~'.$firstname.'~~'.$firstname.'||AllOption~~Case+Status:~~0~~All||selectSortBy~~Sort+By:~~Filed+Date~~Filed+Date ';




            /* Initialise curl and set basic options */
            $curl=curl_init();
            if( parse_url( $url_base,PHP_URL_SCHEME )=='https' ){
                curl_setopt( $curl, CURLOPT_SSL_VERIFYHOST, 2 );
                curl_setopt( $curl, CURLOPT_SSL_VERIFYPEER, false );
                curl_setopt( $curl, CURLOPT_CAINFO, realpath( ROOT . '/cacert.pem' ) );
            }
            curl_setopt( $curl, CURLINFO_HEADER_OUT, true );
            curl_setopt( $curl, CURLOPT_RETURNTRANSFER, true );
            curl_setopt( $curl, CURLOPT_FOLLOWLOCATION, true );
            curl_setopt( $curl, CURLOPT_AUTOREFERER, true );
            curl_setopt( $curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:37.0) Gecko/20100101 Firefox/37.0' );
            curl_setopt( $curl, CURLOPT_HTTPHEADER, array('Host: '.parse_url( $url_login,PHP_URL_HOST ),'Connection: keep-alive') );


            /* Stage 1 - Get initial page to obtain session cookies etc and store in cookiejar...mmmmm, cookies! */
            curl_setopt( $curl, CURLOPT_URL, $url_login );
            curl_setopt( $curl, CURLOPT_COOKIEJAR, $cookiejar );
            curl_setopt( $curl, CURLOPT_COOKIESESSION, true );
            $res=curl_exec( $curl );



            /* Step 2 - GET form page and analyse input elements */
            curl_setopt( $curl, CURLOPT_URL, $url_search );
            curl_setopt( $curl, CURLOPT_AUTOREFERER, true );
            curl_setopt( $curl, CURLOPT_COOKIEFILE, $cookiejar );
            curl_setopt( $curl, CURLOPT_COOKIESESSION, false );
            $res=curl_exec( $curl );



            /* utility class that simplifies getting DOMDocument with error checking etc */
            $dom=new htmldom( $res );
            $html=$dom->gethtml();
            $col=$html->getElementsByTagName('input');

            /* we need to know values for specific fields - add to array of params to be submitted */
            foreach( $col as $index => $node ){
                if( in_array( $node->getAttribute('name'), $formparams ) ) {
                    $stdparams[ $node->getAttribute('name') ]=$node->getAttribute('value');
                }
            }
            /* Not sure how the 'SearchParams' field value is calculated so calculated manually above..... */




            /* Stage 3 - POST */
            /* Prepare search query */
            $info=$res='';
            $querystring=http_build_query( $stdparams, '', '&' );
            curl_setopt( $curl, CURLOPT_URL, $url_search );
            curl_setopt( $curl, CURLOPT_REFERER, $url_search );
            curl_setopt( $curl, CURLOPT_COOKIEFILE, $cookiejar );
            curl_setopt( $curl, CURLOPT_COOKIESESSION, false );
            curl_setopt( $curl, CURLOPT_POST, true );
            curl_setopt( $curl, CURLOPT_POSTFIELDS, $querystring );
            $res=curl_exec( $curl );
            $info=curl_getinfo( $curl );


            /* The final search results - you could manipulate the dom to get specific items if you wished */
            echo '<pre>';
            print_r( $res );
            echo '</pre>';

            /* Close curl */
            curl_close( $curl );

为完整起见,上面提到的实用程序类。

class htmldom{
    private $html;
    public function __construct( $data=false, $convert=true ){
        try{
            if( !$data ) return false;
            libxml_use_internal_errors( true );
            $this->html = new DOMDocument('1.0','utf-8');
            $this->html->validateOnParse=false;
            $this->html->standalone=true;
            $this->html->preserveWhiteSpace=true;
            $this->html->strictErrorChecking=false;
            $this->html->substituteEntities=false;
            $this->html->recover=true;
            $this->html->formatOutput=false;

            $this->html->loadHTML( $convert ? mb_convert_encoding( $data, 'utf-8' ) : $data );

            $parse_errs=serialize( libxml_get_last_error() );
            libxml_clear_errors();

        }catch( Exception $e ){
            die( $e->getMessage() );    
        }
    }
    public function gethtml(){
        return $this->html;
    }
}

答案 2 :(得分:0)

也许您在初始POST后被重定向?

尝试以下方法:

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);

这应该忽略任何重定向。

可能有助于您调试的其他CURL选项:

CURLOPT_MAXREDIRS
CURLOPT_POSTREDIR

参考:

http://php.net/manual/en/function.curl-setopt.php