清理html标签内的内容

时间:2013-06-10 15:15:20

标签: php regex preg-replace

我正在尝试编写一个preg_replace,它将清除允许标记的所有标记属性,以及允许列表中不存在的所有标记。

基本示例 - 这:

<p style="some styling here">Test<div class="button">Button Text</div></p> 

原来是:

<p>test</p>

我运行良好..除了img标签和href标签。我不需要清理img和标签的属性。可能是其他人。我不确定是否有办法设置两个允许列表?

1)清单后允许留下标签的一个清单
2)一个允许但只留下标签的列表?
3)其余部分被删除。

以下是我正在处理的脚本:

$string = '<p style="width: 250px;">This is some text<div class="button">This is the button</div><br><img src="waves.jpg" width="150" height="200" /></p><p><b>Title</b><br>Here is some more text and <a href="#" target="_blank">this is a link</a></p>';

$output = strip_tags($string, '<p><b><br><img><a>');
$output = preg_replace("/<([a-z][a-z0-9]*)[^>]*?(\/?)>/i", '<$1$2$3$4$5>', $output);

echo $output;

此脚本应将$ string清除为:

<p>This is some text<br><img src="waves.jpg" width="150" height="200" /></p><p><b>Title</b><br>Here is some more text and <a href="#" target="_blank">this is a link</a></p>

1 个答案:

答案 0 :(得分:1)

http://ideone.com/aoOOUN

此函数将剥离不允许的子元素元素,清除其“剥离”子元素,并保留其余元素(递归)。

function clean($element, $allowed, $stripped){
    if(!is_array($allowed) || ! is_array($stripped)) return;
    if(!$element)return;
    $toDelete = array();
    foreach($element->childNodes as $child){
        if(!isset($child->tagName))continue;
        $n = $child->tagName;
        if ($n && !in_array($n, $allowed) && !in_array($n, $stripped)){
            $toDelete[] = $child;
            continue;
        }
        if($n && in_array($n, $stripped)){
            $attr = array();
            foreach($child->attributes as $a)
                $attr[] = $a->nodeName;
            foreach($attr as $a)
                $child->removeAttribute($a);
        }
        clean($child, $allowed, $stripped);
    }
    foreach ($toDelete as $del)
        $element->removeChild($del);
}

这是清理字符串的代码:

$xhtml = '<p style="width: 250px;">This is some text<div class="button">This is the button</div><br><img src="waves.jpg" width="150" height="200" /></p><p><b>Title</b><br>Here is some more text and <a href="#" target="_blank">this is a link</a></p>';

$dom = new DOMDocument();
$dom->loadHTML($xhtml);
$body = $dom->getElementsByTagName('body')->item(0);
clean($body, array('img', 'a'), array('p', 'br', 'b'));
echo preg_replace('#^.*?<body>(.*?)</body>.*$#s', '$1', $dom->saveHTML($body));

您应该查看PHP's DOM classes

的文档