有两种评论风格,C风格和C ++风格,如何识别它们?
/* comments */
// comments
我可以随意使用任何方法和第3库。
答案 0 :(得分:5)
为了可靠地查找Java源文件中的所有注释,我不会使用正则表达式,而是使用真正的词法分析器(也就是标记化器)。
Java的两个流行选择是:
与流行的看法相反,ANTLR也可用于在没有解析器的情况下创建仅词法分析器。
这是一个快速的ANTLR演示。您需要在同一目录中包含以下文件:
lexer grammar JavaCommentLexer;
options {
filter=true;
}
SingleLineComment
: FSlash FSlash ~('\r' | '\n')*
;
MultiLineComment
: FSlash Star .* Star FSlash
;
StringLiteral
: DQuote
( (EscapedDQuote)=> EscapedDQuote
| (EscapedBSlash)=> EscapedBSlash
| Octal
| Unicode
| ~('\\' | '"' | '\r' | '\n')
)*
DQuote {skip();}
;
CharLiteral
: SQuote
( (EscapedSQuote)=> EscapedSQuote
| (EscapedBSlash)=> EscapedBSlash
| Octal
| Unicode
| ~('\\' | '\'' | '\r' | '\n')
)
SQuote {skip();}
;
fragment EscapedDQuote
: BSlash DQuote
;
fragment EscapedSQuote
: BSlash SQuote
;
fragment EscapedBSlash
: BSlash BSlash
;
fragment FSlash
: '/' | '\\' ('u002f' | 'u002F')
;
fragment Star
: '*' | '\\' ('u002a' | 'u002A')
;
fragment BSlash
: '\\' ('u005c' | 'u005C')?
;
fragment DQuote
: '"'
| '\\u0022'
;
fragment SQuote
: '\''
| '\\u0027'
;
fragment Unicode
: '\\u' Hex Hex Hex Hex
;
fragment Octal
: '\\' ('0'..'3' Oct Oct | Oct Oct | Oct)
;
fragment Hex
: '0'..'9' | 'a'..'f' | 'A'..'F'
;
fragment Oct
: '0'..'7'
;
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
JavaCommentLexer lexer = new JavaCommentLexer(new ANTLRFileStream("Test.java"));
CommonTokenStream tokens = new CommonTokenStream(lexer);
for(Object o : tokens.getTokens()) {
CommonToken t = (CommonToken)o;
if(t.getType() == JavaCommentLexer.SingleLineComment) {
System.out.println("SingleLineComment :: " + t.getText().replace("\n", "\\n"));
}
if(t.getType() == JavaCommentLexer.MultiLineComment) {
System.out.println("MultiLineComment :: " + t.getText().replace("\n", "\\n"));
}
}
}
}
\u002f\u002a <- multi line comment start
multi
line
comment // not a single line comment
\u002A/
public class Test {
// single line "not a string"
String s = "\u005C" \242 not // a comment \\\" \u002f \u005C\u005C \u0022;
/*
regular multi line comment
*/
char c = \u0027"'; // the " is not the start of a string
char q1 = '\u005c''; // == '\''
char q2 = '\u005c\u0027'; // == '\''
char q3 = \u0027\u005c\u0027\u0027; // == '\''
char c4 = '\047';
String t = "/*";
\u002f\u002f another single line comment
String u = "*/";
}
现在,要运行演示,请执行:
bart@hades:~/Programming/ANTLR/Demos/JavaComment$ java -cp antlr-3.2.jar org.antlr.Tool JavaCommentLexer.g
bart@hades:~/Programming/ANTLR/Demos/JavaComment$ javac -cp antlr-3.2.jar *.java
bart@hades:~/Programming/ANTLR/Demos/JavaComment$ java -cp .:antlr-3.2.jar Main
你会看到以下内容被打印到控制台:
MultiLineComment :: \u002f\u002a <- multi line comment start\nmulti\nline\ncomment // not a single line comment\n\u002A/
SingleLineComment :: // single line "not a string"
SingleLineComment :: // a comment \\\" \u002f \u005C\u005C \u0022;
MultiLineComment :: /*\n regular multi line comment\n */
SingleLineComment :: // the " is not the start of a string
SingleLineComment :: // == '\''
SingleLineComment :: // == '\''
SingleLineComment :: // == '\''
SingleLineComment :: \u002f\u002f another single line comment
当然,您可以自己创建一种带正则表达式的词法分析器。但是,以下演示不处理源文件中的Unicode文字:
/* <- multi line comment start
multi
line
comment // not a single line comment
*/
public class Test2 {
// single line "not a string"
String s = "\" \242 not // a comment \\\" ";
/*
regular multi line comment
*/
char c = '"'; // the " is not the start of a string
char q1 = '\''; // == '\''
char c4 = '\047';
String t = "/*";
// another single line comment
String u = "*/";
}
import java.util.*;
import java.io.*;
import java.util.regex.*;
public class Main2 {
private static String read(File file) throws IOException {
StringBuilder b = new StringBuilder();
Scanner scan = new Scanner(file);
while(scan.hasNextLine()) {
String line = scan.nextLine();
b.append(line).append('\n');
}
return b.toString();
}
public static void main(String[] args) throws Exception {
String contents = read(new File("Test2.java"));
String slComment = "//[^\r\n]*";
String mlComment = "/\\*[\\s\\S]*?\\*/";
String strLit = "\"(?:\\\\.|[^\\\\\"\r\n])*\"";
String chLit = "'(?:\\\\.|[^\\\\'\r\n])+'";
String any = "[\\s\\S]";
Pattern p = Pattern.compile(
String.format("(%s)|(%s)|%s|%s|%s", slComment, mlComment, strLit, chLit, any)
);
Matcher m = p.matcher(contents);
while(m.find()) {
String hit = m.group();
if(m.group(1) != null) {
System.out.println("SingleLine :: " + hit.replace("\n", "\\n"));
}
if(m.group(2) != null) {
System.out.println("MultiLine :: " + hit.replace("\n", "\\n"));
}
}
}
}
如果您运行Main2
,则会在控制台上打印以下内容:
MultiLine :: /* <- multi line comment start\nmulti\nline\ncomment // not a single line comment\n*/
SingleLine :: // single line "not a string"
MultiLine :: /*\n regular multi line comment\n */
SingleLine :: // the " is not the start of a string
SingleLine :: // == '\''
SingleLine :: // another single line comment
答案 1 :(得分:3)
编辑:我一直在搜索,但这里是真正的正在使用的正则表达式:
String regex = "((//[^\n\r]*)|(/\\*(.+?)\\*/))"; // New Regex
List<String> comments = new ArrayList<String>();
Pattern p = Pattern.compile(regex, Pattern.DOTALL);
Matcher m = p.matcher(code);
// code is the C-Style code, in which you want to serach
while (m.find())
{
System.out.println(m.group(1));
comments.add(m.group(1));
}
使用此输入:
import Blah;
//Comment one//
line();
/* Blah */
line2(); // something weird
/* Multiline
another line for the comment
*/
它会生成此输出:
//Comment one//
/* Blah */
line2(); // something weird
/* Multiline
another line for the comment
*/
请注意,输出的最后三行是一次打印。
答案 2 :(得分:0)
你试过正则表达式吗? Here是Java示例的一个很好的总结。 可能需要一些调整然而,仅使用正则表达式对于更复杂的结构(嵌套注释,字符串中的“注释”)是不够的,但这是一个不错的开始。