用于检测多行文本的正则表达式,如果以字符结尾

时间:2015-11-18 12:07:07

标签: .net regex vb.net macros multiline

我有一个解析PAWN语言代码的解析器。

我已经有一个正则表达式解析代码中的定义,典型的定义如下:

<div class="first-page-wrapper">    
    <div class="first-page">.</div> <!-- .first-page -->  
    <div class="first-page-sibling">.</div>
</div><!-- .first-page-wrapper -->

    	<p style="background-color:pink; opacity:0.5;">Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of de Finibus Bonorum et Malorum (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, Lorem ipsum dolor sit amet.., comes from a line in section 1.10.32.
    	
        The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from "de Finibus Bonorum et Malorum" by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham. Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industrys standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

        It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Various versions have evolved over the years, sometimes by accident, sometimes on purpose (injected humour and the like).
        
    	Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. Lorem Ipsum comes from sections 1.10.32 and 1.10.33 of de Finibus Bonorum et Malorum (The Extremes of Good and Evil) by Cicero, written in 45 BC. This book is a treatise on the theory of ethics, very popular during the Renaissance. The first line of Lorem Ipsum, Lorem ipsum dolor sit amet.., comes from a line in section 1.10.32.
    	
    	The standard chunk of Lorem Ipsum used since the 1500s is reproduced below for those interested. Sections 1.10.32 and 1.10.33 from de Finibus Bonorum et Malorum by Cicero are also reproduced in their exact original form, accompanied by English versions from the 1914 translation by H. Rackham.
    	
    	There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour, or randomised words which dont look even slightly believable. If you are going to use a passage of Lorem Ipsum, you need to be sure there isnt anything embarrassing hidden in the middle of text. All the Lorem Ipsum generators on the Internet tend to repeat predefined chunks as necessary, making this the first true generator on the Internet. It uses a dictionary of over 200 Latin words, combined with a handful of model sentence structures, to generate Lorem Ipsum which looks reasonable. The generated Lorem Ipsum is therefore always free from repetition, injected humour, or non-characteristic words etc.</p>

我使用以下正则表达式来检测它:

#define DEFINE_NAME DEFINE_VALUE

现在解决实际问题.. PAWN语言只允许多行,如果每行以斜杠结束..所以这是有效的:

#define[ \t]+([^\n\r\s\\;]+)(?:[ \t]*([^\s;]+))?

如果有更多斜线,可以继续。

Soo ..我想要一个正则表达式,可以捕获这样的可能的多行内容。

  

注意:我还需要它在单行定义中工作..所以请记住这一点。

     

我也使用.NET,所以这就是风味。

非常感谢任何帮助/贡献。 :d

1 个答案:

答案 0 :(得分:1)

我们可以包含可选的斜杠和换行符:

(?:\\\r?\n[ \t]*)?

然后,为了允许多个以斜杠结尾的行,我们可以重复以下构造:

(?<value>(?>                  # Captures the DEFINE_VALUE
    [^\\\r\n;]+               #   Any char (except \ \n)
  |                           #  or
    \\[^\r\n][^\\\r\n;]*      #   "\" within value
)+)?                          #  (~unrolling the loop)
(?:\\\r?\n[ \t]*)?            # allow "\" for new line  

<强>代码

Dim pattern As String = "^[ \t]*                    # beginning of line     " & vbCrLf &
                        "[#]define[ \t]+            # PAWN #define          " & vbCrLf &
                        "(?<name>[^\s\\;]+)         # DEFINE_NAME           " & vbCrLf &
                        "[ \t]*(?:\\\r?\n[ \t]*)?   # spaces and optional \ " & vbCrLf &
                        "(?>                        #                       " & vbCrLf &
                        "  (?<value>(?>             # DEFINE_VALUE          " & vbCrLf &
                        "    [^\\\r\n;]+   |        #  Any char -except \ \n" & vbCrLf &
                        "    \\[^\r\n][^\\\r\n;]*   #  \ within value       " & vbCrLf &
                        "  )+)?                     #  (~unrolling the loop)" & vbCrLf &
                        "  (?:\\\r?\n[ \t]*)?       # \ for new line        " & vbCrLf &
                        ")*                         # repeated for each line"

Dim re As Regex = new Regex( pattern, RegexOptions.Multiline Or
                                      RegexOptions.IgnorePatternWhitespace)
Dim text As String =    "#define DEFINE_NAME \"     & vbCrLf &
                        "       DEFINE VALUE\"      & vbCrLf &
                        "       CONTINUE VALUE"     & vbCrLf &
                        "#define TheName TheValue"
Dim mNum As Integer = 0
Dim matches As MatchCollection = re.Matches(text)

'Loop Matches
For Each match As Match In matches
    'get name
    Dim name As String = match.Groups("name").Value
    Console.WriteLine("Match #{0} - Name: {1}", mNum, name)

    'get values (in each capture)
    Dim captureCtr As Integer = 0
    For Each capture As Capture In match.Groups("value").Captures
        'loop captures for the Group "value"
        Console.WriteLine(vbTab & "Line #{0} - Value: {1}", 
                                captureCtr, capture.Value)
        captureCtr += 1              
    Next
    mNum += 1
Next

<强>输出

Match #0 - Name: DEFINE_NAME
    Line #0 - Value: DEFINE_VALUE
    Line #1 - Value: CONTINUE_VALUE
Match #1 - Name: TheName
    Line #0 - Value: TheValue

ideone demo

  • 请注意,我正在使用named groups (?<name>..)(?<value>..)。这就是为什么它在代码中引用为match.Groups("name")

  • 此外,每行重复组(?<value>[^\s;]+)Groups("value")包含有关最后捕获的子字符串的信息。但 Captures property 包含有关该组捕获的所有子字符串的信息。这是一个独特的功能 这就是我循环match.Groups("value").Captures

  • 的原因