Question

在我的php脚本中，变量有以下html。

<div>
    first line starting text  <span class='highlight blink'> first line middlte text1 </span> first line end text.
    second line starting text  <span class="target"> second line middlte text2  </span> second line end text
    <div class="highlight blink"> third line text</div>
</div>

我想删除带有突出显示类的标记，因此上面的html看起来像这样（仅使用正则表达式）

<div>
   first line starting text  first line middlte text1 first line end text.
   second line starting text  <span class="target"> second line middlte text2  </span> second line end text
   third line text
</div>

我试过这个，但它没有替换有多个类的div标签（见第三行，必须删除div标签）

$data = preg_replace('#<(\w+) class=["\']highlight["\']>(.*)<\/\1>#', '\2', $data);

我试过这个，但它用类替换整个标记。（参见第二行，带目标类的span标记应保持不变）

$data = preg_replace('#<(\w+) class=["\'](\w+)["\']>(.*)<\/\1>#', '\2', $data);

任何人都可以提前帮助thanx，我正在尝试2天

Answer 1

不使用正则表达式怎么样？

<?php

// you HTML string
$string = <<<HTML
<div>
    first line starting text  <span class='highlight blink'> first line middlte text1 </span> first line end text.
    second line starting text  <span class="target"> second line middlte text2  </span> second line end text
    <div class="highlight blink"> third line text</div>
</div>
HTML;

// classname
$classname = 'highlight';

$doc = new DOMDocument();

// load HTML and remove doctype, html, body tags in PHP >= 5.4.0
$doc->loadHTML($string, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

// load HTML and remove doctype, html, body tags in PHP < 5.4.0
/*
$doc->loadHTML($string);
$doc->removeChild($doc->doctype);
$doc->replaceChild($doc->firstChild->firstChild->firstChild, $doc->firstChild);
*/

$finder = new DOMXPath($doc);

/** @var DOMNodeList $nodes */
$nodes = $finder->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' $classname ')]");

/** @var DOMElement $node */
foreach ($nodes as $node) {

    /** @var DOMElement $parent */
    $parent = $node->parentNode;

    /** @var DOMText $child */
    $child = $doc->createTextNode(trim($node->nodeValue));

    $parent->insertBefore($child, $node);
    $parent->removeChild($node);
}

var_dump($doc->saveHTML());

Answer 2

可以使用正则表达式完成（但是在可以安全的情况下查看我的previous answers之一）。

尽管如此，这个特殊情况非常困难，因为您必须考虑标签的所有可能配置，并且最终可能会匹配您不想要的内容。我强烈建议您使用类似于建议的here

的HTML解析器

无论如何，尝试尽可能通用且安全的可能解决方案可以是：

$data = "<div>
first line starting text  <span class='highlight blink'> first line <b>middlte</b> text1 </span> first line end text.
second line starting text  <span class='target'> second line middlte text2  </span> second line end text
    <div class='highlight blink'> third line text</div>
</div>";

$data = preg_replace(
  '/<(\w+).*[^>]+class=["\'][^"\']*highlight[^"\']*["\'][^>]*>(.*?)<\/\1>/',
  '$2',
  $data );

echo( $data );

适用于文本class内包含highlight属性的每个代码，以及外部代码中有嵌套代码的内容，例如<div class='highlight'>Something <b>else</b></div>

Example here

更新：Working PHP example

正则表达式：

搜索开场标记
在开头和标记结尾之间寻找class="..."
除了class="..."

>

在class="..."内，我们会查找可能被其他字词包围的单词highlight
在所有这些之后，我们搜索任何字符序列，直到找到
使用反向引用找到的开头匹配的结束标记

使用特定类删除html标记，但不使用正则表达式删除内容

2 个答案: