如何从字符串中删除重复的html标签-php / regex

时间:2019-06-25 20:36:34

标签: php regex

我有一些错误的在线html编辑器创建的html文件。用户正在选择任何文本并按下斜体按钮,然后文本将被插入<em></em>标签中。

通过使用此功能-有时,用户将一些文本设为斜体,然后将其删除,然后他又变回斜体。

在许多情况下,我收到带有重复标签的错误HTML代码,如下所示:

示例1:

Adding insult to injury, <em><em>Jennifer <a href="somelink">Aniston</a></em> had literally <a href="somelink2">zero clue</a> what was coming.</em>

示例2:

Adding insult to injury, <em><em>Jennifer Aniston</em> had literally <a href="somelink2">zero clue</a> what was coming.</em>

问题是如何删除重复的标签-另一个<em>标签内的<em>-标签不是必需的,应将其删除。

我写了一个代码,但是它不能很好地工作-很好的解决方案是使用reg exp-我尝试了一些正则表达式,但是没有用,所以我改用另一种方式:

function repairDoubleTags($line = '', $rtag = 'em') {
    if(empty($line)) return false;

    if(!preg_match("#<".$rtag.">#", $line)) 
        return $line;

    $tmp = explode(" ", $line);
    //print_r($tmp);

    $lastposition = -1;
    $remove_next = 0;

    foreach($tmp as $nr => $word) {     
        //echo $word."\r\n";

        if(empty($word)) {
            unset($tmp[$nr]);
            continue;
        }

        if(preg_match("#<".$rtag.">#", $word)) {
            if($lastposition == -1) {
                $lastposition = $nr;
                //echo "----------------- ".$rtag." FOUND\r\n";
            }else {
                $tmp[$nr] = trim(preg_replace("#<".$rtag.">#", "", $tmp[$nr]));
                $remove_next = 1;
                $lastposition = -1;
                //echo "----------------- DOUBLE ".$rtag." FOUND AND REMOVED\r\n";
            }
        }

        if(preg_match("#</".$rtag.">#", $word)) {
            if($remove_next == 1) {
                $tmp[$nr] = trim(preg_replace("#</".$rtag.">#", "", $tmp[$nr]));
                $remove_next = 0;
                //echo "----------------- DOUBLE END ".$rtag." FOUND AND REMOVED\r\n";
            }else {
                $lastposition = -1;
            }
        }

        if(empty($tmp[$nr]))
            unset($tmp[$nr]);

    }

    //print_r($tmp);
    $line = join(' ', $tmp);
    //print_r($line);
    //exit;

    return $line;
}

但是,如果html代码包含多个<em>,则此代码不起作用-例如,在以下情况下不起作用:

Adding insult to injury, <em><em>Jennifer Aniston</em> had literally <a href="somelink2">zero clue</a> what <em>was coming</em>.</em>

有任何regex专家寻求快速不错的解决方案吗?

谢谢!

1 个答案:

答案 0 :(得分:-1)

猜测我们可能在此处遇到的其他无效<em>有点复杂,但是,如果您想探索正则表达式选项,我们可能可以从类似于以下内容的表达式开始:

(?=<em><em>)(<em>)(.*?)(<\/em>)

并替换为$2。这仅是示例,该表达式无疑容易失败。

  

如果我们可能还有em以外的其他无效标签,则只需遍历表达式并进行替换即可。

测试

$re = '/(?=<em><em>)(<em>)(.*?)(<\/em>)/m';
$str = 'Adding insult to injury, <em><em>Jennifer <a href="somelink">Aniston</a></em> had literally <a href="somelink2">zero clue</a> what was coming.</em>

Adding insult to injury, <em><em>Jennifer Aniston</em> had literally <a href="somelink2">zero clue</a> what was coming.</em>
Adding insult to injury, <em><em>Jennifer Aniston</em> had literally <a href="somelink2">zero clue</a> what was coming.</em>

';
$subst = '$2';

$result = preg_replace($re, $subst, $str);

echo $result;

Please see the demo for additional explanation.

输出

Adding insult to injury, <em>Jennifer <a href="somelink">Aniston</a> had literally <a href="somelink2">zero clue</a> what was coming.</em>

Adding insult to injury, <em>Jennifer Aniston had literally <a href="somelink2">zero clue</a> what was coming.</em>
Adding insult to injury, <em>Jennifer Aniston had literally <a href="somelink2">zero clue</a> what was coming.</em>