如何去除包含多个字符串和注释符号的行的注释

时间:2014-02-21 20:32:17

标签: java regex

我想解析包含由#字符引入的单行注释的KConf文件。您可以在下面找到此类文件的示例。

https://github.com/torvalds/linux/blob/master/arch/x86/Kconfig

我知道单行测试字符串看起来几乎是随机的,尽管它应该包含大多数(如果不是全部)嵌套哈希和字符串的变体以及注释中不引入字符串的引号。

我目前使用的正则表达式引擎是基于Java的Groovy中的那个。

测试字符串

Lorem "ipsum # \" dolor" sit amet, 'consectetur # \' adipiscing' elit. Maecenas 'suscipit #mollis' quam, non #bibendum 'elit # eleifend "in. Duis # convallis" luctus nunc, ac luctus lectus dapibus at.

期望的结果

Lorem "ipsum # \" dolor" sit amet, 'consectetur # \' adipiscing' elit. Maecenas 'suscipit #mollis' quam, non

(带有前导空格)

#bibendum 'elit # eleifend "in. Duis # convallis" luctus nunc, ac luctus lectus dapibus at.

3 个答案:

答案 0 :(得分:1)

首先,我已经转义了你的字符串,因此它可以使用JavaScript存储为变量(因为你似乎没有表示语言,我会假设JS):

var str = 'Lorem "ipsum # " dolor" sit amet, \'consectetur # \' adipiscing\' elit. Maecenas \'suscipit#mollis\' quam, non #bibendum \'elit # eleifend "in. Duis # convallis" luctus nunc, ac luctus lectus dapibus at.';

要删除“”后跟“#”后面的所有内容,<#>不是后跟一个空格:

str.replace(/ #[^ ].*/, '');

最后,你的第二个预期结果完全没有意义。

所有这一切当然都会得到适当的描述。

答案 1 :(得分:0)

根据有限的信息,这个正则表达式可能会起作用 尽管如此,试图区分嵌入式哈希与冥想似乎有点复杂 没有时间测试它,但切了几个正则表达式 请注意,它应该在多线模式中使用。而且一切都适合线条解析 即正则表达式中的任何内容都不会跨越行。

 #  (?-s)^(?:"[^"\\\n]*(?:\\.[^"\\\n]*)*"|'[^'\\\n]*(?:\\.[^'\\\n]*)*'|[^#"'\s]+|(?<=[^\s#])\#+|[^\S\n]+(?!\#))*(?:[^\S\n]+|^)(\#.*)$
 #  "(?-s)^(?:\"[^\"\\\\\\n]*(?:\\\\.[^\"\\\\\\n]*)*\"|'[^'\\\\\\n]*(?:\\\\.[^'\\\\\\n]*)*'|[^#\"'\\s]+|(?<=[^\\s#])\\#+|[^\\S\\n]+(?!\\#))*(?:[^\\S\\n]+|^)(\\#.*)$"

 (?-s)                   # Modifier, No dot all 
 ^                       # Beginning of line
 (?:
      "                       # Double quotes
      [^"\\\n]* 
      (?: \\ . [^"\\\n]* )*
      "
   |                        # or
      '                       # Single quotes
      [^'\\\n]* 
      (?: \\ . [^'\\\n]* )*
      '
   |                        # or
      [^#"'\s]+               # Not hash, quotes, whitespace
   |                        # or
      (?<= [^\s#] )           # Preceded by a character, but not hash or whitespace
      \#+                     # Embeded hashes
   |                        # or
      [^\S\n]+                # Whitespaces (non-newline)
      (?! \# )                # Not folowed by hash
 )*
 (?: [^\S\n]+ | ^ )      # Whitespaces  (non-newline) or BOL
 ( \# .* )               # (1), hash comment
 $                       # End of line

答案 2 :(得分:0)

原始正则表达式:

^((?:\\.|("|')(?:(?!\2|\\|[\r\n]).|\\.)*\2|[^#'"\r\n])+)#.+

替换为$1

示例:

String re = "^((?:\\\\.|(\"|')(?:(?!\\2|\\\\|[\\r\\n]).|\\\\.)*\\2|[^#'\"\\r\\n])+)#.+";
String line = "Lorem \"ipsum # \\\" dolor\" sit amet, 'consectetur # \\' adipiscing' elit. Maecenas 'suscipit #mollis' quam, non #bibendum 'elit # eleifend \"in. Duis # convallis\" luctus nunc, ac luctus lectus dapibus at.";
String uncommented = line.replaceAll(re, "$1");

//=> Lorem "ipsum # \" dolor" sit amet, 'consectetur # \' adipiscing' elit. Maecenas 'suscipit #mollis' quam, non

regex101 demo

ideone demo

故障:

^                         # Beginning of line
  (                       # Beginning of 1st capture group
    (?:                   # Non-capture group 1
      \\.                 # Match an escaped character
    |
      ("|')               # Or, a quote (and capture it in 2nd capture group),
      (?:                 # Non-capture group 2
        (?!\2|\\|[\r\n]). # Followed by any character except relevant quote, \ or newline
      |
        \\.               # Or an escaped character
      )*                  # Close of non-capture group 2 and repeat as many times
      \2                  # Close the quoted part
    |
      [^#'"\r\n]          # Any non-hash, single/double quote, newline characters
    )+                    # Close of non-capture group 1 and repeat as many times
  )                       # Close capture group 1
  #.+                     # Match comments