获取非空元素的HTML内容

时间:2017-05-06 16:32:00

标签: php html domdocument

目前我有这个烂摊子,没问怎么样:

$string = "
<p>
    <b>Foo1:</b> Bar1<br>
    <b>Foo2:</b> Bar2<br>
    <b>Foo3:</b> Bar3<br>
    <b>Foo4:</b> Bar4
</p>
<br>
<p></p>
<br>
<p>
</br>
<br />
<br/>
<br>
</p>
"

所以我需要修剪所有这些<br><p>这样:

$string = "
<p>
    <b>Foo1:</b> Bar1<br>
    <b>Foo2:</b> Bar2<br>
    <b>Foo3:</b> Bar3<br>
    <b>Foo4:</b> Bar4
</p>
"

我试着这样做:

$chars = " \t\n\r\0\x0B";
$subpattern = '(</?(br|p) ?/?[^>]*>)';
$pattern = '~(^'.$subpattern.'|'.$subpattern.'$)~i';

trim(preg_replace($pattern, '', $string), $chars)

但它只删除了最后<p>,我怎么能让它正常工作?

4 个答案:

答案 0 :(得分:0)

使用strip_tags函数。 Link to function description in PHP Doc.

答案 1 :(得分:0)

尝试解析HTML,然后丢弃空元素,而不是尝试正则表达式方法,因为这实际上是你想要实现的。像DOMDocument :: loadHTML(http://php.net/manual/en/domdocument.loadhtml.php)之类的东西会给你一个数组结构,你可以循环然后转换回HTML,一旦你删除了你不需要的部分。

答案 2 :(得分:0)

使用DOMDocument和DOMXPath的方法:

function isEmpty($n) {
    $nodeList = $n[0]->childNodes;
    foreach ($nodeList as $childNode) {
        switch ( $childNode->nodeType ) {
            case XML_ELEMENT_NODE:
                if ( !in_array($childNode->nodeName, ["p", "br"]) ||
                     $childNode->nodeName == "p" && !isEmpty([$childNode]) ) return false;
            case XML_TEXT_NODE:
                if ( trim($childNode->nodeValue) !== "" ) return false;
        }
    }
    return true;
}

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($string);
$xp = new DOMXPath($dom);
$xp->registerNamespace("php", "http://php.net/xpath");
$xp->registerPHPFunctions('isEmpty');

$nodeList = $xp->query('//br[not(./ancestor::p)] | //p[php:function("isEmpty", .)]');

foreach ($nodeList as $node) {
    $node->parentNode->removeChild($node);
}

foreach ($dom->getElementsByTagName('body')->item(0)->childNodes as $childNode) {
    echo $dom->saveHTML($childNode);
}

demo

答案 3 :(得分:0)

不应使用HTML解析regex,而应使用DOMDocument,我们只需DOMDocument

查询//p/b/..

Try this code snippet here

<?php
ini_set('display_errors', 1);
libxml_use_internal_errors(true);

$string = <<<HTML
<p>
    <b>Foo1:</b> Bar1<br>
    <b>Foo2:</b> Bar2<br>
    <b>Foo3:</b> Bar3<br>
    <b>Foo4:</b> Bar4
</p>
</p>
<br>
<p></p>
<br>
<p>
</br>
<br/ >
<br/>
<br>
</p>
HTML;
$domObject= new DOMDocument();
$domObject->loadHTML($string, LIBXML_HTML_NODEFDTD);

$domXpath= new DOMXPath($domObject);
$results=$domXpath->query('//p/b/..');
foreach($results as $result)
{
    echo $domObject->saveHTML($result);
}

<强>输出:

<p>
    <b>Foo1:</b> Bar1<br>
    <b>Foo2:</b> Bar2<br>
    <b>Foo3:</b> Bar3<br>
    <b>Foo4:</b> Bar4
</p>