如何以数组形式或xml形式获取html数据?

时间:2013-09-26 05:35:11

标签: php html xml web-scraping

我希望以数组形式或xml格式获取我的html数据,以便可以轻松地将其保存在数据库中。到目前为止,这是我的工作:

$url = "http://www.example.com/";

    $ch = curl_init();

    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 10);
    if($html = curl_exec($ch)){

        // parse the html into a DOMDocument
        $dom = new DOMDocument();

        $dom->recover = true;
        $dom->strictErrorChecking = false;

        @$dom->loadHTML($html);

        $hrefs = $dom->getElementsByTagName('div');


        curl_close($ch);


    }else{
        echo "The website could not be reached.";
    }

我该怎么做才能以数组形式或xml格式获取html。 html即将发布:

<div>
 <ul>
   <li>Product Name</li>
   <li>Category</li>
   <li>Subcategory</li>
   <li>Product Price</li>
   <li>Product Company</li>
 </ul>
</div>

1 个答案:

答案 0 :(得分:1)

对于XML输出,请执行以下操作:

function download_page($path){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL,$path);
curl_setopt($ch, CURLOPT_FAILONERROR,1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,1);
curl_setopt($ch, CURLOPT_TIMEOUT, 15);
$retValue = curl_exec($ch);          
curl_close($ch);
return $retValue;
}

$sXML = download_page('http://example.com');
$oXML = new SimpleXMLElement($sXML);

foreach($oXML->entry as $oEntry){
    header('Content-type: application/xml')
    echo $oEntry->title . "\n";
}