正则表达式匹配具有特定属性的html标记

时间:2012-01-25 18:50:08

标签: regex pattern-matching string-matching

我正在尝试匹配所有没有“term”或“range”属性的HTML标记

这里是HTML格式示例

<span class="inline prewrap strong">DATE:</span>    12/01/10
<span class="inline prewrap strong">MR:</span>  1234567
<span class="inline prewrap strong">DOB:</span> 12/01/65
<span class="inline prewrap strong">HISTORY OF PRESENT ILLNESS:</span>  Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum

<span class="inline prewrap strong">MEDICATIONS:</span>  <span term="Advil" range="true">Advil </span>and Ibuprofen.

我的正则表达式是:<(.*?)((?!\bterm\b).)>

不幸的是,这匹配所有标签......如果内部文本不匹配将会很好,因为我需要过滤掉除具有该特定属性的标签之外的所有标签。

5 个答案:

答案 0 :(得分:9)

如果正则表达式是你的事情,这对我有用。 (注意 - 不包括过滤掉评论,doctype和其他实体。
其他警告;标签可以嵌入脚本,评论和其他内容。)

span 标记( w / attr )没有字词|范围attrs

'<span
  (?=\s)
  (?! (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) (?:term|range) \s*= )
  \s+ (?:".*?"|\'.*?\'|[^>]*?)+ 
>'

任何 标记( w / attr )没有字词|范围attrs

'<[A-Za-z_:][\w:.-]*
  (?=\s)
  (?! (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) (?:term|range) \s*= )
  \s+ (?:".*?"|\'.*?\'|[^>]*?)+ 
>'

任何 标记( w / o attr )没有字词|范围attrs

'<
  (?:
    [A-Za-z_:][\w:.-]*
    (?=\s)
    (?! (?:[^>"\']|(?>".*?"|\'.*?\'))*? (?<=\s) (?:term|range) \s*= )
    \s+ (?:".*?"|\'.*?\'|[^>]*?)+ 
  |
    /?[A-Za-z_:][\w:.-]*\s*/?
  )
>'

<强>更新

使用(?&gt;)构造的替代方案 正则表达式下面是no-'term | range'-attributes
标志=(g)全局和(s)dotall

span标记w / attr
链接:http://regexr.com?2vrjr
正则表达式:<span(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+>

任何标签w / attr
链接:http://regexr.com?2vrju
正则表达式:<[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+>

任何标签w / attr或wo / attr
链接:http://regexr.com?2vrk1
正则表达式:<(?:[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)(?:term|range)\s*=)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+|/?[A-Za-z_:][\w:.-]*\s*/?)>

'匹配除了具有term =“偶尔”'

的标签以外的所有标签

链接:http://regexr.com?2vrka
<(?:[A-Za-z_:][\w:.-]*(?=\s)(?!(?:[^>"\']|"[^"]*"|\'[^\']*\')*?(?<=\s)term\s*=\s*(["'])\s*occasionally\s*\1)(?!\s*/?>)\s+(?:".*?"|\'.*?\'|[^>]*?)+|/?[A-Za-z_:][\w:.-]*\s*/?)>

答案 1 :(得分:1)

我认为您应该使用HTML解析器来解决此问题。创建自己的正则表达式是可能的,但肯定是错误的。想象一下,您的代码包含这样的表达式

< span      class = "a"              >b< / span         >

它也有效,但要考虑正则表达式中的所有可能空格和TAB字符并不容易,并且需要先进行测试,然后才能确定它是否按预期工作。

答案 2 :(得分:1)

这将做你想要的。它是为Perl程序编写的,格式可能因您使用的语言而异[

/(?! [^>]+ \b(?:item|range)= ) (<[a-z]+.*?>) /igx

下面的代码演示了Perl程序中的这种模式

use strict;
use warnings;

my $pattern = qr/ (?! [^>]+ \b(?:item|range)= ) (<[a-z]+.*?>) /ix;

my $str = <<'END';

<span class="inline prewrap strong">DATE:</span>    12/01/10
<span class="inline prewrap strong">MR:</span>  1234567
<span class="inline prewrap strong">DOB:</span> 12/01/65
<span class="inline prewrap strong">HISTORY OF PRESENT ILLNESS:</span>  Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum

<span class="inline prewrap strong">MEDICATIONS:</span>  <span term="Advil" range="true">Advil </span>and Ibuprofen.

END

print "$_\n" foreach $str =~ /$pattern/g;

<强>输出

<span class="inline prewrap strong">
<span class="inline prewrap strong">
<span class="inline prewrap strong">
<span class="inline prewrap strong">
<span class="inline prewrap strong">

答案 3 :(得分:0)

<\w+\s+(?!term).*?>(.*?)</.*?>

答案 4 :(得分:0)

我认为此正则表达式可以正常工作。

此正则表达式将选择任何HTML标记的样式属性。

<\s*\w*\s*style.*?>

您可以在https://regex101.com

上进行检查