加载和解析HTML字符串

时间:2015-07-26 09:14:59

标签: php html

当我尝试解析Google的搜索结果时出现错误

$html = file_get_contents('http://www.google.dk/search?q='.urlencode($query).'&start=0&num=100', false, $context);

$doc = new DOMDocument();
$doc->loadHTML($html);

错误

PHP Warning:  DOMDocument::loadHTML(): Input is not proper UTF-8, indicate encoding ! in Entity, line: 1 in /var/www/dynaccount.com/class/Cronjob_check_serp_position.php on line 132

Warning: DOMDocument::loadHTML(): Input is not proper UTF-8, indicate encoding ! in Entity, line: 1 in /var/www/dynaccount.com/class/Cronjob_check_serp_position.php on line 132
PHP Warning:  DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 1 in /var/www/dynaccount.com/class/Cronjob_check_serp_position.php on line 132

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 1 in /var/www/dynaccount.com/class/Cronjob_check_serp_position.php on line 132
PHP Warning:  DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity, line: 1 in /var/www/dynaccount.com/class/Cronjob_check_serp_position.php on line 132

1 个答案:

答案 0 :(得分:1)

libxml有一些内置的错误处理,这将有助于

            $query='php rocks';

            $data=file_get_contents('http://www.google.co.uk/search?q='.urlencode( $query ).'&start=0&num=100');
            libxml_use_internal_errors( true );
            $html = new DOMDocument('1.0','utf-8');
            $html->validateOnParse=false;
            $html->standalone=true;
            $html->preserveWhiteSpace=true;
            $html->strictErrorChecking=false;
            $html->substituteEntities=false;
            $html->recover=true;
            $html->formatOutput=true;
            $html->loadHTML( $data );
            $parse_errs=serialize( libxml_get_last_error() );
            libxml_clear_errors();


            $xpath=new DOMXPath( $html );
            $div=$html->getElementById('ires');
            $col=$xpath->query("ol/li/h3/a", $div );

            foreach( $col as $node ) echo $node->getAttribute('href').'<br />';

            $html=null;
            $xpath=null;