我正在努力从字符串中提取内容(存储在DB中)。 每个div都是一个章节,h2内容是标题。我想分别提取每章(div)的标题和内容
<p>
<div>
<h2>Title 1</h2>
Chapter Content 1 with standard html tags (ex: the following tags)
<strong>aaaaaaaa</strong><br />
<em>aaaaaaaaa</em><br />
<u>aaaaaaaa</u><br />
<span style="color:#00ffff"></span><br />
</div>
<div>
<h2>Title 2</h2>
Chapter Content 2
</div>
...
</p>
我在php中尝试过preg_match_all,但是当我使用标准的html标签时它不起作用
function splitDescription($pDescr)
{
$regex = "#<div.*?><h2.*?>(.*?)</h2>(.*?)</div>#";
preg_match_all($regex, $pDescr, $result);
return $result;
}
答案 0 :(得分:1)
在您尝试使用正则表达式解析HTML之前,我建议您read this post.
答案 1 :(得分:1)
不要使用正则表达式,它不是正确的工具。使用HTML解析器,例如PHP的DOMDocument
:
libxml_use_internal_errors( true);
$doc = new DOMDocument;
$doc->loadHTML( $html);
$xpath = new DOMXPath( $doc);
// For each <div> chapter
foreach( $xpath->query( '//div') as $chapter) {
// Get the <h2> and save its inner value into $title
$title_node = $xpath->query( 'h2', $chapter)->item( 0);
$title = $title_node->textContent;
// Remove the <h2>
$chapter->removeChild( $title_node);
// Save the rest of the <div> children in $content
$content = '';
foreach( $chapter->childNodes as $child) {
$content .= $doc->saveHTML( $child);
}
echo "$title - " . htmlentities( $content) . "\n";
}