Question

$output = htmlentities("example<br><br>example");
echo $output;

$output = preg_replace( 
  array( '#[\s\n\\n]*<[\/\s]*(br|hr|/p|/div)[\/\s]*>[\s\n\\n]*#iu', '#\s+#' ), 
  ' ', 
  $output );
echo $output;

以上代码将打印example<br><br>example>而不是example example。两个echo都打印相同的字符串example<br><br>example>。但是我需要继续使用htmlentities()，因为如果我不使用它，preg_replace将会销毁一些特殊字符，如à。我在这个问题中提到过：PHP regex breaking special characters

有人知道任何解决方案吗？感谢。

Answer 1

htmlentities将<和>替换为<和>，因此您需要在正则表达式中搜索替换。

$output = preg_replace( 
  array( '#\s*&lt;[\/\s]*(br|hr|/p|/div)[\/\s]*&gt;\s*#iu', '#\s+#' ), 
  ' ', 
  $output );
echo $output;

Answer 2

如果我理解正确，你需要一个strip_tags变体，它会在相邻的文本节点之间留一个空格，以避免单词粘在一起。

执行此操作的一种方法是使用DOMDocument类。您可能还想删除不可打印的内容，例如script标记的内容：

function DOMRemoveTags($dom, $tags) {
    foreach($tags as $tag) {
        foreach(iterator_to_array($dom->getElementsByTagName($tag)) as $node) {
            $node->parentNode->removeChild($node);
        };  
    }
}

function getHtmlText($html) {
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    // Remove some tags together with their content
    DOMRemoveTags($dom, ['script','textarea','iframe']); // extend as needed
    $xpath = new DOMXPath($dom);
    // Get all text nodes and join them with a space delimiter
    return implode(' ', array_map(function($node) {
        return trim($node->nodeValue);
    }, iterator_to_array($xpath->query('//text()'))));
}

$html = "example<br><br><script>fdsfsd</script><script>222</script>example";
echo htmlentities(getHtmlText($html));

通过使用此DOM API，您可以避免正则表达式解决方案存在的一些潜在问题：如果HTML字符串具有不是标记开头的<个字符（在文本，属性值，注释，脚本中），...），正则表达式可能会产生不希望的结果。

PHP：preg_replace不适用于htmlentities（）结果

2 个答案: