用curl和regex获取表数据

时间:2014-09-02 07:28:41

标签: php html xpath web-scraping domdocument

这是我从表中提取数据的代码。

但我想删除链接。

以及如何将标题和价格分组到数组中。

<?php

$ch = curl_init ("http://www.digionline.ir/Allprovince/CategoryProducts/cat=10301");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);

preg_match('#<table[^>]*>(.+?)</table>#is', $page, $matches);
foreach ($matches as &$match) {
$match = $match;
}
echo '<table>';

echo  $match ;
echo '</table>';

?>  

2 个答案:

答案 0 :(得分:3)

我建议改用HTML Parser。使用DOMDocument + DOMXpath,无需安装它们已经内置。例如:

$ch = curl_init ("http://www.digionline.ir/Allprovince/CategoryProducts/cat=10301");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$page = curl_exec($ch);

$dom = new DOMDocument();
libxml_use_internal_errors(true);
$dom->loadHTML($page);
libxml_clear_errors();
$xpath = new DOMXpath($dom);

$data = array();
// get all table rows and rows which are not headers
$table_rows = $xpath->query('//table[@id="tbl-all-product-view"]/tr[@class!="rowH"]');
foreach($table_rows as $row => $tr) {
    foreach($tr->childNodes as $td) {
        $data[$row][] = preg_replace('~[\r\n]+~', '', trim($td->nodeValue));
    }
    $data[$row] = array_values(array_filter($data[$row]));
}

echo '<pre>';
print_r($data);

$data应包含:

Array
(
    [0] => Array
    (
        [0] => AMDA4-3400
        [1] => 1,200,000
        [2] => 1,200,000
    )

    [1] => Array
    (
        [0] => AMDSempron 145
        [1] => 860,000
        [2] => 910,000
    )

答案 1 :(得分:0)

如果要解析某些网络资源,可以使用PHP Simple HTML DOM Parser

如果你想获得一张桌子和桌子内的所有链接:

$html = file_get_html('http://www.digionline.ir/Allprovince/CategoryProducts/cat=10301');
$table = $html->find('table');
$links = $table->find('a');

echo $table;