Question

我目前在Python脚本中使用此片段来检测Javadoc注释：

# This regular expression matches Javadoc comments.
pattern = r'/\*\*(?:[^*]|\*(?!/))*\*/'
# Here's how it works:
# /\*\*    matches leading '/**' (have to escape '*' as metacharacters)
# (?:      starts a non-capturing group to match one comment character
#  [^*]    matches any non-asterisk characters...
#  |       or...
#  \*      any asterisk...
#   (?!/)  that's not followed by a slash (negative lookahead)
# )        end non-capturing group
# *        matches any number of these non-terminal characters
# \*/      matches the closing '*/' (again, have to escape '*')
comments = re.findall(pattern, large_string_of_java_code)

这个正则表达式并不完美。我没关系，它与Unicode转义序列不匹配（例如，评论/** a */可以写成\u002f** a */）。我遇到的主要问题是它会对这样的评论产生误报：

// line comment /** not actually a javadoc comment */

并且可能会破坏这样的评论：

// line comment /** unfinished "Javadoc comment"
// regex engine is still searching for closing slash

我尝试对^.$//使用负面的lookbehind，但是，根据Python docs，

...包含的模式必须只匹配某些固定长度的字符串。

所以这不起作用。

我也尝试从行的开头开始，如下所示：

pattern = r'^(?:[^/]|/(?!/))*(the whole regex above)'

但是我无法让它发挥作用。

正则表达式是否适合此任务？我怎样才能让它发挥作用？

如果正则表达不是正确的工具，我很乐意使用任何轻量级的内置Python 2模块。

Answer 1

如果您需要精确度并且正在使用Java代码，那么最好与javadoc（或doxygen）集成。也许这会有所帮助：How to extract JavaDoc comments from the source files

如果你不需要精确性，你应该能够通过分阶段进行正常表达式以适应大多数情况：可能先从消除混淆部分开始（//和非javadoc / * * / comments ），然后查找javadoc注释。但是你还必须决定一种处理碰巧嵌入字符串的Javadoc分隔符的方法......问题更多的是关于词法分析。也许这对你的申请来说已经足够了？

改进Javadoc正则表达式

1 个答案: