Question

我正在寻找一个命令行工具或C，C ++，Python或Node.js的库，它们只能从各种语言的源文件中提取注释。

例如，给出＆＃34; bob.c＆＃34;：

int main(){ //Here is a comment
  int i=3;  /*Another comment*/
}

应该返回以下内容：

Here is a comment
Another comment

可能包含行号。

这适用于＆＃34; bob.py＆＃34;，＆＃34; bob.js＆＃34;，＆＃34; bob.css＆＃34;，＆＃34; bob.rb＆＃34; ，＆＃34; bob.asm＆＃34;等等。

这个问题与this other one不同，因为我不仅对C风格的评论感兴趣，也对其他评论感兴趣。

此外，作为解决方案，我非常怀疑正则表达式。注释式短语可以在引用文本中以深刻的方式定位;我还没有看到关于SO的正则表达式解决方案。

Answer 1

您可以使用包含python，C ++，grep等任何正则表达式的表格，请注意，许多语言都有多种注释类型，某些类型的注释（在某些语言中）可以是多线的。可以轻松返回行号。

以python re库文档为出发点。

Answer 2

通过Ira Baxter的有用建议，我通过搜索词法分析器来追踪Pygments。

Pygments了解massive number of languages并将这些语言中的任何一种语言的输入转换为适合突出显示的标准化HTML输出。

例如，给定以下C ++代码：

/*Testing
a multi-line
comment*/
int main(){ //And a comment here
  int i=0; //And a comment here as well
  printf("//I am not a comment.");
}

Pygments返回：

<div class="highlight"><pre><span class="cm">/*Testing</span>
<span class="cm">a multi-line</span>
<span class="cm">comment*/</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">(){</span> <span class="c1">//And a comment here</span>
  <span class="kt">int</span> <span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="c1">//And a comment here as well</span>
  <span class="n">printf</span><span class="p">(</span><span class="s">&quot;//I am not a comment.&quot;</span><span class="p">);</span>
<span class="p">}</span>
</pre></div>

请注意，注释被标识为span类，例如c1和cm。

这减少了解析许多语言的问题以解决理解Pygment输出的问题。

为此，我引入lxml，它可以解析XML和HTML，返回节点的文本和行号。这实现了我的目的。

下面的代码演示了设置：

import lxml.etree
import pygments
import pygments.lexers
import pygments.formatters

cppcode="""
/*Testing
a multi-line
comment*/
int main(){ //And a comment here
  int i=0; //And a comment here as well
  printf("//I am not a comment.");
}
"""

#Use Pygments to convert arbitrary source code into well-defined XML
hcode = pygments.highlight(cppcode, pygments.lexers.get_lexer_by_name("c++"), pygments.formatters.HtmlFormatter())

#Use LXML to parse the well-defined XML
q = lxml.etree.fromstring(hcode)

#Use *gasp* regular expressions to match comment tags
regexpNS = "http://exslt.org/regular-expressions"
r = q.xpath("//span[re:test(@class, '^c')]", namespaces={'re': regexpNS})

#Print line number of comment and the comment's text
for i in r:
  print "%d: %s" % (i.sourceline,i.text)

请注意，Pygments会输出几种不同的注释范围。可以使用以下方式查看这些内容：

print pygments.formatters.HtmlFormatter().get_style_defs()

我们看到以下组：

.cm { color: #408080; font-style: italic } /* Comment.Multiline */
.cp { color: #BC7A00 } /* Comment.Preproc */
.c1 { color: #408080; font-style: italic } /* Comment.Single */
.cs { color: #408080; font-style: italic } /* Comment.Special */

当然，没有非评论样式元素以c开头。

Answer 3

[OP要求将此贴出来作为答案]

如果你想处理各种各样的语言，你要么需要决定它们属于类别（类C，有C风格注释，COBOL和COBOL样式注释，......）并构建一个词法分析器每。如果语言有很多奇怪的词法语法（PHP在这方面非常重要，请查看插值字符串），这些词法分析器的细节可能会变得棘手。

如果你想要一个现成的，我们的源代码搜索引擎通过lexing和索引你给它的代码库提供大规模搜索;它有大约40多种语言和方言的词法分析器;要求它查找所有注释（或任何其他标记）并将它们全部作为搜索命中导出到命中日志文件是微不足道的。（打开日志后，命令就是字母“C”[用于评论]。）

[回答另一个问题]。它有GUI和命令行界面。

评论提取器

3 个答案: