Question

我正在使用ruby中的正则表达式删除代码文件中的注释。代码是C ++（但我认为这不相关），文件包含类似的内容：

/*
    Hello! I'm a comment!
*/

int main(int argc, char* argv[])
{
    Foo foo;
    foo.bar();
    return 0;
}

我的目标是删除代码中的注释，同时解析它们，现在我可以通过捕获然后删除来实现：

text.scan(UGLY_COMMENTS_REGEX).each do |m|
 m.method_for_printing_matched_comment
end 
text = text.gsub(UGLY_COMMENTS_REGEX,'');

我遇到的另一种选择是为每个正则表达式匹配执行gsub，而不是使用完整的正则表达式执行此操作，例如：

text.scan(UGLY_COMMENTS_REGEX).each do |m|
 m.method_for_printing_matched_comment
 text = text.gsub(m,'');
end

这个（也是次优的）替代方案的问题是，当匹配包含“组”时，例如m [0]，m [1] ...

，这不是直截了当的。

由于这样做效率非常低，我想知道是否有任何方式只进行一次匹配（对于捕获和删除）。

Answer 1

String#gsub!（和其他String#gsub，String#sub!，String#sub）接受一个可选块（将使用匹配的字符串调用）。所以你可以这样做：

text.gsub!(UGLY_COMMENTS_REGEX) { |m|
  puts m # to print the matched comment  / OR  m.method_for_printing_matched_comment
  ''     # Return value is used as a replacement string; effectively remove the comment
}

Answer 2

我相信以下内容应该有效。

<强>代码

def strip_comments(str)
  comments = []
  [str.split(/[ \t]*\/\*|\*\/(?:[ \t]*\n?/)
      .select.with_index {|ar,i| i.even? ? true : (comments << ar.strip; false)}
      .join,
   comments]
end

示例

str =<<_ /* Hello! I'm a comment! */ int main(int argc, char* argv[]) { Foo foo; /* Let's get this one too */ foo.bar(); return 0; } _ cleaned_code, comments = strip_comments(str) puts cleaned_code # int main(int argc, char* argv[]) # { # Foo foo; # foo.bar(); # return 0; # } puts comments # Hello! I'm a comment! # Let's get this one too

<强>解释

对于上面的例子。

comments = []

在/*或*/上拆分字符串将创建一个数组，其中每个其他元素都是注释的文本。数组的第一个元素是要保留的文本，如果字符串以注释开头，则等于""。为了保留正确的格式（我希望），我还要删除/*之前的任何空格或制表符（但不是换行符）以及*/之后的换行符或空格后跟换行符。

b = str.split(/[ \t]*\/\*|\*\/(?:[ \t]*\n)?/) #=> ["", # "\n Hello! I'm a comment!\n", # "\nint main(int argc, char* argv[])\n{\n Foo foo;\n", # " Let's get this one too ", # " foo.bar();\n return 0;\n}\n"]

我们希望选择不是评论的元素，同时保留后者：

enum0 = b.select #=> #<Enumerator: [ # "", # "\n Hello! I'm a comment!\n", # "\nint main(int argc, char* argv[])\n{\n Foo foo;\n", # " Let's get this one too ", # " foo.bar();\n return 0;\n}\n"]:select>

添加索引，以便我们能够确定哪些元素是注释：

enum1 = enum0.with_index #=> #<Enumerator: #<Enumerator: [ # "", # "\n Hello! I'm a comment!\n", # "\nint main(int argc, char* argv[])\n{\n Foo foo;\n", # " Let's get this one too ", # " foo.bar();\n return 0;\n}\n"]:select>:with_index>

您可能会将enum1视为“复合枚举器”。要查看它将传递到其块中的元素，请将其转换为数组：

enum1.to_a #=> [["", 0], # ["\n Hello! I'm a comment!\n", 1], # ["\nint main(int argc, char* argv[])\n{\n Foo foo;\n", 2], # [" Let's get this one too ", 3], # [" foo.bar();\n return 0;\n}\n", 4]]

使用Array#each执行带有块的枚举器：

c = enum1.each {|ar,i| i.even? ? true : (comments << ar.strip; false)} #=> ["", # "\nint main(int argc, char* argv[])\n{\n Foo foo;\n", # " foo.bar();\n return 0;\n}\n"]

确认comments已正确构建：

puts comments # Hello! I'm a comment! # Let's get this one too

加入c：
的元素
cleaned_text = c.join #=> "\nint main(int argc, char* argv[])\n{\n Foo foo;\n foo.bar();\n return 0;\n}\n"

并返回：

[cleaned_text, comments]

如上所示。

编辑：好一点，我想：

def strip_comments(str) a = str.split(/[ \t]*\/\*|\*\/(?:[ \t]*\n)?/) a << "" if a.size.odd? cleaned, comments = a.each_pair.transpose [cleaned.join, comments.map(&:strip)] end

Ruby Regex：同时捕获和删除的有效方法

2 个答案: