VB.NET 2010:使用Regex匹配Java多行注释

时间:2018-07-16 16:14:58

标签: regex vb.net-2010

我想从文件中删除(Java / C / C ++ / ..)多行注释。为此,我写了一个正则表达式:

/\*[^\*]*(\*+[^\*/][^\*]*)*\*+/

此正则表达式可与Nodepad ++和Geany一起使用(搜索并全部替换为空)。 regex在VB.NET中的行为有所不同。

我正在使用:

Microsoft Visual Studio 2010 (Version 10.0.40219.1 SP1Rel)
Microsoft .NET Framework (4.7.02053 SP1Rel)

我正在运行替换文件的文件并不复杂。我不需要打扰可能引起注释开头或结尾的任何引用文本。

@sln感谢您的详细答复,我也将像您一样迅速地解释我的正则表达式!

/\*                      Find the beginning of the comment.
[^\*]*                   Match any chars, but not an asterisk.
                         We need to deal with finding an asterisk now:
(\*+[^\*/][^\*]*)*       This regex breaks down to:
 \*+                     Consume asterisk(s).
    [^\*/]               Match any other char that is not an asterisk or a / (would end the comment!).
          [^\*]*         Match any other chars that are not asterisks.
(               )*       Try to find more asterisks followed by other chars.

\*+/                     Match 1 to n asterisks and finish the comment with /.

以下是两个代码段:

第一:

text

/*
 * block comment
 *
 */ /* comment1 */ /* comment2 */

My text to keep.

/* more comments */

more text

第二:

text

/*
 * block comment
 *
 */ /* comment1 *//* comment2 */

My text to keep.

/* more comments */

more text

唯一的区别是

/* comment1 *//* comment2 */

使用Notepad ++和Geany删除找到的匹配项在两种情况下都非常适用。对于第二个示例,无法使用VB.NET中的正则表达式。删除后的第二个示例的结果如下:

text



more text

但是它应该看起来像这样:

text



My text to keep.



more text

我正在使用System.Text.RegularExpressions:

Dim content As String = IO.File.ReadAllText(file_path_)
Dim multiline_comment_remover As Regex = New Regex("/\*[^\*]*(\*+[^\*/][^\*]*)*\*+/")
content = multiline_comment_remover.Replace(content, "")

我希望在VB.NET中获得与Notepad ++和Geany相同的结果。正如sln回答的那样,我的正则表达式“应该以一种奇怪的方式工作”。问题是为什么VB.NET无法按预期处理该正则表达式?这个问题仍然悬而未决。

由于sln的答案可以使我的代码正常工作,因此我将接受该答案。尽管这不能解释为什么VB.NET不喜欢我的正则表达式。感谢你的帮助!我学到了很多东西!

1 个答案:

答案 0 :(得分:0)

我认为您可以使用通用的C ++注释剥离器。

基本上是
Glbolly在下面找到,替换为$2

演示PCRE:https://regex101.com/r/UldYK5/1
演示Python:https://regex101.com/r/avfSfB/1

    # raw:   (?m)((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|(?:\r?\n|[\S\s])[^/"'\\\s]*)
    # delimited:  /(?m)((?:(?:^[ \t]*)?(?:\/\*[^*]*\*+(?:[^\/*][^*]*\*+)*\/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/)))?|\/\/(?:[^\\]|\\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|\/\*|\/\/))|(?=\r?\n))))+)|((?:"[^"\\]*(?:\\[\S\s][^"\\]*)*"|'[^'\\]*(?:\\[\S\s][^'\\]*)*'|(?:\r?\n(?:(?=(?:^[ \t]*)?(?:\/\*|\/\/))|[^\/"'\\\r\n]*))+|[^\/"'\\\r\n]+)+|[\S\s][^\/"'\\\r\n]*)/

    (?m)                             # Multi-line modifier
    (                                # (1 start), Comments
         (?:
              (?: ^ [ \t]* )?                  # <- To preserve formatting
              (?:
                   /\*                              # Start /* .. */ comment
                   [^*]* \*+
                   (?: [^/*] [^*]* \*+ )*
                   /                                # End /* .. */ comment
                   (?:                              # <- To preserve formatting
                        [ \t]* \r? \n
                        (?=
                             [ \t]*
                             (?: \r? \n | /\* | // )
                        )
                   )?
                |
                   //                               # Start // comment
                   (?:                              # Possible line-continuation
                        [^\\]
                     |  \\
                        (?: \r? \n )?
                   )*?
                   (?:                              # End // comment
                        \r? \n
                        (?=                              # <- To preserve formatting
                             [ \t]*
                             (?: \r? \n | /\* | // )
                        )
                     |  (?= \r? \n )
                   )
              )
         )+                               # Grab multiple comment blocks if need be
    )                                # (1 end)

 |                                 ## OR

    (                                # (2 start), Non - comments
         # Quotes
         # ======================
         (?:                              # Quote and Non-Comment blocks
              "
              [^"\\]*                          # Double quoted text
              (?: \\ [\S\s] [^"\\]* )*
              "
           |                                 # --------------
              '
              [^'\\]*                          # Single quoted text
              (?: \\ [\S\s] [^'\\]* )*
              '
           |                                 # --------------

              (?:                              # Qualified Linebreak's
                   \r? \n
                   (?:
                        (?=                              # If comment ahead just stop
                             (?: ^ [ \t]* )?
                             (?: /\* | // )
                        )
                     |                                 # or,
                        [^/"'\\\r\n]*                    # Chars which doesn't start a comment, string, escape,
                                                         # or line continuation (escape + newline)
                   )
              )+
           |                                 # --------------
              [^/"'\\\r\n]+                    # Chars which doesn't start a comment, string, escape,
                                               # or line continuation (escape + newline)

         )+                               # Grab multiple instances

      |                                 # or,
         # ======================
         # Pass through

         [\S\s]                           # Any other char
         [^/"'\\\r\n]*                    # Chars which doesn't start a comment, string, escape,
                                          # or line continuation (escape + newline)

    )                                # (2 end), Non - comments

如果您使用不支持断言的特定引擎,
那么您就必须使用它。
但是,这不会保留格式。

用法与上面相同。

    # (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\\]|\\\n?)*?\n)|("(?:\\[\S\s]|[^"\\])*"|'(?:\\[\S\s]|[^'\\])*'|[\S\s][^/"'\\]*)


    (                                # (1 start), Comments 
         /\*                              # Start /* .. */ comment
         [^*]* \*+
         (?: [^/*] [^*]* \*+ )*
         /                                # End /* .. */ comment
      |  
         //                               # Start // comment
         (?: [^\\] | \\ \n? )*?           # Possible line-continuation
         \n                               # End // comment
    )                                # (1 end)
 |  
    (                                # (2 start), Non - comments 
         "
         (?: \\ [\S\s] | [^"\\] )*        # Double quoted text
         "
      |  '
         (?: \\ [\S\s] | [^'\\] )*        # Single quoted text
         ' 
      |  [\S\s]                           # Any other char
         [^/"'\\]*                        # Chars which doesn't start a comment, string, escape,
                                          # or line continuation (escape + newline)
    )                                # (2 end)