Question

以Parslet自己的创建者（在此link中提供）中的代码示例为起点我需要对其进行扩展，以便从文件中检索所有未注释的文本用类C语法编写。

提供的示例能够成功解析C风格的注释，将这些区域视为常规行空间。但是，这个简单的示例只需要文件的非注释区域中的'a'字符，例如输入示例：

         a
      // line comment
      a a a // line comment
      a /* inline comment */ a 
      /* multiline
      comment */

用于检测未评论文本的规则只是：

   rule(:expression) { (str('a').as(:a) >> spaces).as(:exp) }

因此，我需要的是概括先前的规则以从更通用的文件中获取所有其他（未注释的）文本，例如：

     word0
  // line comment
   word1 // line comment
  phrase /* inline comment */ something 
  /* multiline
  comment */

我是Parsing Expression Grammars的新手，我以前的试验都没有成功。

Answer 1

一般的想法是，在出现其中一个序列//或/*之前，所有内容都是代码（也就是非注释）。您可以使用以下规则来反映这一点：

rule(:code) {
  (str('/*').absent? >> str('//').absent? >> any).repeat(1).as(:code)
}

正如我在评论中所提到的，字符串存在一个小问题。当在字符串内发生注释时，它显然是字符串的一部分。如果要从代码中删除注释，则可以更改此代码的含义。因此，我们必须让解析器知道字符串是什么，并且其中的任何字符都属于它。另一件事是逃逸序列。例如，包含文字双引号的字符串"foo \" bar /*baz*/"实际上将被解析为"foo \"，然后再次显示一些代码。这当然是需要解决的问题。我编写了一个完整的解析器来处理所有上述情况：

require 'parslet'

class CommentParser < Parslet::Parser
  rule(:eof) { 
    any.absent? 
  }

  rule(:block_comment_text) {
    (str('*/').absent? >> any).repeat.as(:comment)
  }

  rule(:block_comment) {
    str('/*') >> block_comment_text >> str('*/')
  }

  rule(:line_comment_text) {
    (str("\n").absent? >> any).repeat.as(:comment)
  }

  rule(:line_comment) {
    str('//') >> line_comment_text >> (str("\n").present? | eof)
  }

  rule(:string_text) {
    (str('"').absent? >> str('\\').maybe >> any).repeat
  }

  rule(:string) {
    str('"') >> string_text >> str('"')
  }

  rule(:code_without_strings) {
    (str('"').absent? >> str('/*').absent? >> str('//').absent? >> any).repeat(1)
  }

  rule(:code) {
    (code_without_strings | string).repeat(1).as(:code)
  }

  rule(:code_with_comments) {
    (code | block_comment | line_comment).repeat
  }

  root(:code_with_comments)
end

它将解析您的输入

     word0
  // line comment
   word1 // line comment
  phrase /* inline comment */ something 
  /* multiline
  comment */

到这个AST

[{:code=>"\n   word0\n "@0},
 {:comment=>" line comment"@13},
 {:code=>"\n  word1 "@26},
 {:comment=>" line comment"@37},
 {:code=>"\n phrase "@50},
 {:comment=>" inline comment "@61},
 {:code=>" something \n "@79},
 {:comment=>" multiline\n comment "@94},
 {:code=>"\n"@116}]

要提取除您可以执行的评论之外的所有内容：

input = <<-CODE
     word0
  // line comment
   word1 // line comment
  phrase /* inline comment */ something 
  /* multiline
  comment */
CODE

ast = CommentParser.new.parse(input)
puts ast.map{|node| node[:code] }.join

将产生

   word0

  word1
 phrase  something

Answer 2

处理评论的另一种方法是将它们视为空格。例如：

rule(:space?) do
  space.maybe
end

rule(:space) do
  (block_comment | line_comment | whitespace).repeat(1)
end

rule(:whitespace) do
  match('/s')
end

rule(:block_comment) do
  str('/*') >>
  (str('*/').absent >> match('.')).repeat(0) >>
  str('*/')
end

rule (:line_comment) do
  str('//') >> match('[^\n]') >> str("\n")
end

然后，当你用白色空间编写规则时，例如这完全是袖手旁观，可能是错误的C规则，

rule(:assignment_statement) do
  lvalue >> space? >> str('=') >> space? >> rvalue >> str(';')
end

评论被解析器“吃掉”而没有任何大惊小怪。可以或必须出现任何白色空间，允许任何类型的注释，并将其视为空白区域。

这种方法不适合您的完全问题，即识别C程序中的非注释文本，但它在必须识别完整语言的解析器中运行良好。< / p>

如何使用Parslet处理Ruby中的C风格注释？

2 个答案: