Question

我正在尝试编写一个正则表达式，它将删除除SRC属性之外的所有标记属性。例如：

<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>

将以：

返回

<p>This is a paragraph with an image <img src="/path/to/image.jpg" /></p>

我有一个正则表达式来删除所有属性，但我正在尝试调整它以留在src中。这是我到目前为止所做的：

<?php preg_replace('/<([A-Z][A-Z0-9]*)(\b[^>]*)>/i', '<$1>', '<html><goes><here>');

使用PHP的preg_replace（）来实现此目的。

谢谢！伊恩

Answer 1

这可能适合您的需要：

$text = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';

echo preg_replace("/<([a-z][a-z0-9]*)(?:[^>]*(\ssrc=['\"][^'\"]*['\"]))?[^>]*?(\/?)>/i",'<$1$2$3>', $text);

// <p>This is a paragraph with an image <img src="/path/to/image.jpg"/></p>

RegExp细分：

/              # Start Pattern
 <             # Match '<' at beginning of tags
 (             # Start Capture Group $1 - Tag Name
  [a-z]         # Match 'a' through 'z'
  [a-z0-9]*     # Match 'a' through 'z' or '0' through '9' zero or more times
 )             # End Capture Group
 (?:           # Start Non-Capture Group
  [^>]*         # Match anything other than '>', Zero or More Times
  (             # Start Capture Group $2 - ' src="...."'
   \s            # Match one whitespace
   src=          # Match 'src='
   ['"]          # Match ' or "
   [^'"]*        # Match anything other than ' or " 
   ['"]          # Match ' or "
  )             # End Capture Group 2
 )?            # End Non-Capture Group, match group zero or one time
 [^>]*?        # Match anything other than '>', Zero or More times, not-greedy (wont eat the /)
 (\/?)         # Capture Group $3 - '/' if it is there
 >             # Match '>'
/i            # End Pattern - Case Insensitive

添加一些引用，并使用替换文字<$1$2$3>它应该从格式正确的HTML标记中删除任何非src=属性。

请注意这不一定适用于所有输入，因为Anti-HTML + RegExp人员如此巧妙地在下面注明。有一些后备，最值得注意的是<p style=">">最终会<p>">以及其他一些问题......我建议将Zend_Filter_StripTags视为PHP中的完整校样标记/属性过滤器

Answer 2

You usually should not parse HTML using regular expressions

相反，您应该致电DOMDocument::loadHTML 然后，您可以通过文档中的元素进行递归并调用removeAttribute。

Answer 3

不幸的是，我不确定如何回答PHP的这个问题。如果我使用Perl，我会执行以下操作：

use strict;
my $data = q^<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>^;

$data =~ s{
    <([^/> ]+)([^>]+)> # split into tagtype, attribs
}{
    my $attribs = $2;
    my @parts = split( /\s+/, $attribs ); # separate by whitespace
    @parts = grep { m/^src=/i } @parts;   # retain just src tags
    if ( @parts ) {
        "<" . join( " ", $1, @parts ) . ">";
    } else {
        "<" . $1 . ">";
    }
}xseg;

print( $data );

返回

<p>This is a paragraph with an image <img src="/path/to/image.jpg"></p>

Answer 4

好的，这是我用过的似乎运作良好的东西：

<([A-Z][A-Z0-9]*)(\b[^>src]*)(src\=[\'|"|\s]?[^\'][^"][^\s]*[\'|"|\s]?)?(\b[^>]*)>

随意戳出任何洞。

Answer 5

如上所述，您不应该使用正则表达式来解析html或xml。

我会用str_replace（）做你的例子;如果它一直都是一样的。

$str = '<p id="paragraph" class="green">This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/></p>';

$str = str_replace('id="paragraph" class="green"', "", $str);

$str = str_replace('width="50" height="75"',"",$str);

Answer 6

发布为Oracle Regex提供解决方案

<([^!][a-z][a-z0-9]*)([^>]*(\ssrc=[''''\"][^''''\"]*[''''\"]))?[^>]*?(\/?)>

Answer 7

不要使用正则表达式来解析有效的 html。仅当所有可用的 DOM 解析器都失败时，才使用正则表达式来解析 html 文档。我非常喜欢正则表达式，但正则表达式是“DOM-ignorant”，它会悄悄地失败和/或改变你的文档。

为了简洁、直接和直观地定位文档实体，我通常更喜欢 DOMDocument 和 XPath 的组合。

除了少数几个小例外，XPath 表达式与简单英语的逻辑非常相似。

//@*[not(name()="src")]

在文档中的任何级别 (//)
查找任何属性 (@*)
满足这些要求 ([])
那不是 (not())
名为“src”(name()="src")

这更具可读性、吸引力和可维护性。

代码：(Demo)

$html = <<<HTML
<p id="paragraph" class="green">
    This is a paragraph with an image <img src="/path/to/image.jpg" width="50" height="75"/>
</p>
HTML;

$dom = new DOMDocument;
$dom->loadHTML($html, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//@*[not(name()="src")]') as $attr) {
    $attr->parentNode->removeAttribute($attr->nodeName);
}
echo $dom->saveHTML();

输出：

<p>
    This is a paragraph with an image <img src="/path/to/image.jpg">
</p>

如果要添加另一个豁免属性，可以使用 or

//@*[not(name()="src" or name()="href")]

正则表达式：剥离除SRC之外的HTML属性

7 个答案: