使用preg_match_all从html中提取数据

时间:2017-04-28 19:44:51

标签: php preg-match-all

我有一系列html页面,我想从中提取某些产品信息。 HTML构建如下:

<h1 style="margin-top: 20px;">Productinformatie</h1>


<div class="group">
<div class="columns2">
            <table width="100%" cellpadding="4" cellspacing="0" border="0" class="product_info_table stripe">
    <tr style="background-color: #3c75a6; color: #fff; font-weight: bold;">
        <td colspan="2" style="background-color: #3c75a6; border-bottom: 2px solid #9dbeda;">Design</td>
    </tr>
                    <tr class="normal">
            <td width="250" valign="top"><b>Kleur van het product</b></td>
            <td><div style="max-height: 40px; overflow: hidden;">Zwart, Zilver</div></td>
        </tr>
.............
                    <tr class="normal">
            <td width="250" valign="top"><b>Hoogte (achterzijde)</b></td>
            <td><div style="max-height: 40px; overflow: hidden;">3 cm</div></td>
        </tr>
                </table>

</div>  
</div>

<div class="group" style="overflow-x: auto; overflow-y: hidden; height: 140px; white-space: nowrap;" id="image_scroll">

我使用此行但未获得结果;我需要了解如何在preg_match_all

中格式化Linebrakes(BR)
        //Omschrijving  <h1 style="margin-top: 20px;">Productinformatie</h1>    <div class="group"> <div class="columns2">  </table>    </div>      </div>
//  preg_match_all('/\<h1 style\=\"margin-top\: 20px\;\"\>Productinformatie\<\/h1\>(.*?)\<ul style\=\"list\-style\-type\: none\;\"\>/s', $html, $matchomschrijving);  
    preg_match_all('/\<h1 style\=\"margin-top\: 20px\;\"\>Productinformatie\<\/h1\>(.*)?\<\/table\>.*?\<\/div\>?\<\/div\>/s', $html, $matchomschrijving);  
//  $tempomschrijvinghtml = str_replace('"',"'",$matchomschrijving[1][0]); 
    $tempomschrijvinghtml = MinifyHTML($matchomschrijving[1][0]);
//  $tempomschrijving = '<table>';
    $tempomschrijving .= $tempomschrijvinghtml;
    $tempomschrijving .= '</table></div></div>';
    echo 'Omschrijving: ' . $tempomschrijving . '<br>'; 

感谢。

1 个答案:

答案 0 :(得分:0)

要搜索,提取和编辑html,请利用内置的DOMxxx类和html结构。使用XPath语言,您可以有效地定位所需的DOM树部分。例如:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xp = new DOMXPath($dom);

$nodeList = $xp->query('//h1[.="Productinformatie"]/following-sibling::div[@class="group"]/div[@class="columns2"]/table[1]');

echo $dom->saveHTML($nodeList->item(0));