正则表达式:如何提取HTML标题标签

时间:2015-01-25 04:11:29

标签: php regex

提取所有标题标签(h1,h2,h3,...)及其内容。例如:

<h1 id="title">This is the title</h1>
<h2 id="subtitle">This is the subtitle</h2>
<p>And this is the paragraph</p>

将被提取为:

<h1 id="title">This is the title</h1><h2 id="subtitle">This is the subtitle</h2>

我正在使用PHP并使用正则表达式作为标题说。

1 个答案:

答案 0 :(得分:2)

建议使用正确的tool来完成任务。

$doc = DOMDocument::loadHTML('
    <h1 id="title">This is the title</h1>
    <h2 id="subtitle">This is the subtitle</h2>
    <p>And this is the paragraph</p>
    <p>another tag</p>
');

$xpath = new DOMXPath($doc);  
$heads = $xpath->query('//h1|//h2|//h3|//h4|//h5|//h6');

foreach ($heads as $tag) {
   echo $doc->saveHTML($tag), "\n";
}

输出

<h1 id="title">This is the title</h1>
<h2 id="subtitle">This is the subtitle</h2>