Question

我必须匹配两个 内不是的直接元素的字符串。换句话说：

hello world goodbye world - 有效

hello world goodbye world - 无效，应匹配

过了一段时间，我完成了这个：

<p>.*<(?:(?!br>)).*?<br><br>.*?<(?:(.*<\/p>))

接近完成这项工作，但它失败了，例如：

abcabc abc - 应该有效

测试页面：http://regexr.com/3c3k8

P.S。：这是为了匹配数据库中的行并手动修改它（我希望）。此外，我不是谁做出决定而我没有投票。

Answer 1

正则表达式肯定不是完成此任务的正确工具。你可以试试some regex like this。

#p tag and stuff before a tag that includes <br><br>
<p>(?:(?!<\/?p)[\s\S])*?

#capture tag that's not a p tag
<(?!p)(\w+)

  #capture tag only if it's not a singleton tag
  (?=(?:(?!<\/?p)[\s\S])*?<\/\1)[^>]*>

  #don't skip the current tag and find <br><br>
  (?:(?!<\/?(?:p|\1))[\s\S])*<br><br>

#stuff until closing p
[\s\S]*?<\/p>

在JS中使用i caseless选项而没有注释。

<p>(?:(?!<\/?p)[\s\S])*?<(?!p)(\w+)(?=(?:(?!<\/?p)[\s\S])*?<\/\1)[^>]*>(?:(?!<\/?(?:p|\1))[\s\S])*<br><br>[\s\S]*?<\/p>

有关详细信息，请参阅有关regex101的说明，请注意，有一些回溯。

Here your regexr sample

Answer 2

我无法强调为什么不应该使用正则表达式来解决这个问题的原因。也许这个解决方案可以证明正则表达式方法的所有错误。

两个 不是
的直接元素

在JavaScript或VB.NET中

以下正则表达式适用于.net，并使用balancing groups验证任意数量的嵌套代码：

<p>                                         # MAIN Opening <p>
(?>[^<]*)                                   # any text
(?>                                         # BEFORE <br><br>
    [^<]+                                   #  any text
  |                                         #  or
    <                                       #  TAGS
    (?:                                     #   Options:
        !--.*?-->                           #    1. comments
      |                                     #
        \/?\s*(?:area|base|br|col           #    2. self-closing tags
                |embed|hr|img|input         #
                |keygen|link|meta|param     #
                |source|track|wbr           #
              )\b[^>]*\/?>                  #
      |                                     #
        \s*(?<p>p\b)                        #    3. opening nested <p>
      |                                     #
        /\s*(?<-p>p\b)                      #    4. closing nested <p>
      |                                     #
        \s*(?<nestedtag>                    #    5.a) if inside a nested tag:
               (?(nestedtag)\k<nestedtag>   #         another nested tag (same tag)
             |                              #
               [-:\w]+)                     #      b) else: opening nested tag (except <p>)
           \b)                              #      *tag ends with word boundary
      |                                     #
        /\s*(?<-nestedtag>\k<nestedtag>\b)  #    6. closing nested tag
      |                                     #
        (?!/\s*p\b)                         #    7. any other tag except <p> (inside nested tag)
    )                                       #   end of Options
    [^>]*>                                  #  end of TAGS before <br><br>
)*?                                         # repeat as few as possible (BEFORE <br><br>)
(?(nestedtag)(?(p)(?!))|(?!))               # Conditions: unbalanced nested tags and balanced <p>
                                            #
(?:<br>){2}                                 # MATCH: <br><br>
                                            #
(?>[^<]*)                                   # AFTER <br><br> (any text)
(?>                                         #
    [^<]+                                   #  any text
  |                                         #  or
    <                                       #  TAGS
    (?:                                     #   Options:
        (?<p>\s*p\b)                        #    1. opening nested <p>
      |                                     #
        (?<-p>/\s*p\b)                      #    2. closing nested <p>
      |                                     #
        (?!/\s*p\b)                         #    3. any other tag (except the main </p)
    )                                       #   end of Options
    [^>]*>                                  #   rest of tag
)*                                          # repeat as much as possible (AFTER <br><br>)
(?(p)(?!))                                  # Conditions: balanced <p> tags
                                            #
</\s*p\b[^>]*>                              # MAIN Closing </p>

vb.net代码

Dim pattern As String = "<p>(?>[^<]*)(?>[^<]+|<(?:!--.*?-->|/?\s*(?:area|base|br|col|embed|hr|img|input|keygen|link|meta|param|source|track|wbr)\b[^>]*/?>|\s*(?<p>p\b)|/\s*(?<-p>p\b)|\s*(?<nestedtag>(?(nestedtag)\k<nestedtag>|[-:\w]+)\b)|/\s*(?<-nestedtag>\k<nestedtag>\b)|(?!/\s*p\b))[^>]*>)*?(?(nestedtag)(?(p)(?!))|(?!))(?:<br>){2}(?>[^<]*)(?>[^<]+|<(?:(?<p>\s*p\b)|(?<-p>/\s*p\b)|(?!/\s*p\b))[^>]*>)*(?(p)(?!))</\s*p\b[^>]*>"


Dim r As Regex = new Regex(pattern, RegexOptions.IgnoreCase Or RegexOptions.Singleline)

Dim m As Match = r.Match(subject)
Dim matchCount as Integer = 0
Do While m.Success
    matchCount += 1
    Console.WriteLine("Match " & matchCount & ": " & m.Groups(0).ToString())
    m = m.NextMatch()
Loop

.NET Fiddle

<强>输出继电器

Match 1: <p>hello <span>world<br><br></span>goodbye world</p>
Match 2: <p><p>xxx</p><span><br><br></span></p>
Match 3: <p><span><span>xxx</span><br><br></span></p>
Match 4: <p>asdf<span>asdf<br><br>asdf</span><br><br></p>
Match 5: <p><span>acb<br><br></span>abcd</p>
Match 6: <p>asdf<span>abc<br><br></span></p>
Match 7: <p><STRONG>Cetárea Duromar</STRONG> es una empresa familiar con más de 20 años de experiencia al
servicio de la restauración y el particular <STRONG>brindando siempre la mejor calidad en mariscos
y un esmerado servicio.<BR><BR></STRONG>Hemos sabido adaptarnos a los nuevos tiempos, incorporando
la mejor tecnología, controlando la calidad de nuestro producto, pero sobre todo exigiéndonos a
nosotros mismos ser superiores cada día para poner lo mejor de nuestro mar en su mesa.<BR><BR>Les
ofrecemos una muy <STRONG>cuidada selección del mejor marisco de la ría, de excelente calidad</STRONG>
y con una presentación extraordinaria.<BR><BR>Producto 100% garantizado.</p>

.NET Fiddle

匹配字符串

2 个答案: