所以我有一些字符串:
//Blah blah blach
// sdfkjlasdf
"Another //thing"
我正在使用java正则表达式替换所有具有双斜杠的行,如下所示:
theString = Pattern.compile("//(.*?)\\n", Pattern.DOTALL).matcher(theString).replaceAll("");
它在大多数情况下都有效,但问题是它会删除所有的事件,我需要找到一种方法让它不删除引用的事件。我该怎么做呢?
答案 0 :(得分:4)
您可以使用一些第三方工具(如ANTLR),而不是使用解析整个Java源文件的解析器,或者自己编写仅解析您感兴趣的部分的解析器。
ANTLR只能定义您感兴趣的令牌(当然还有令人困惑的令牌流的令牌,如多行注释以及字符串和字符串文字)。因此,您只需要定义一个正确处理这些标记的词法分析器(标记器的另一个单词)。
这称为语法。在ANTLR中,这样的语法可能如下所示:
lexer grammar FuzzyJavaLexer;
options{filter=true;}
SingleLineComment
: '//' ~( '\r' | '\n' )*
;
MultiLineComment
: '/*' .* '*/'
;
StringLiteral
: '"' ( '\\' . | ~( '"' | '\\' ) )* '"'
;
CharLiteral
: '\'' ( '\\' . | ~( '\'' | '\\' ) )* '\''
;
将上述内容保存在名为FuzzyJavaLexer.g
的文件中。现在download ANTLR 3.2 here并将其保存在与FuzzyJavaLexer.g
文件相同的文件夹中。
执行以下命令:
java -cp antlr-3.2.jar org.antlr.Tool FuzzyJavaLexer.g
将创建FuzzyJavaLexer.java
源类。
当然你需要测试词法分析器,你可以通过创建一个名为FuzzyJavaLexerTest.java
的文件并复制下面的代码来实现它:
import org.antlr.runtime.*;
public class FuzzyJavaLexerTest {
public static void main(String[] args) throws Exception {
String source =
"class Test { \n"+
" String s = \" ... \\\" // no comment \"; \n"+
" /* \n"+
" * also no comment: // foo \n"+
" */ \n"+
" char quote = '\"'; \n"+
" // yes, a comment, finally!!! \n"+
" int i = 0; // another comment \n"+
"} \n";
System.out.println("===== source =====");
System.out.println(source);
System.out.println("==================");
ANTLRStringStream in = new ANTLRStringStream(source);
FuzzyJavaLexer lexer = new FuzzyJavaLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
for(Object obj : tokens.getTokens()) {
Token token = (Token)obj;
if(token.getType() == FuzzyJavaLexer.SingleLineComment) {
System.out.println("Found a SingleLineComment on line "+token.getLine()+
", starting at column "+token.getCharPositionInLine()+
", text: "+token.getText());
}
}
}
}
接下来,通过执行以下操作来编译您的FuzzyJavaLexer.java
和FuzzyJavaLexerTest.java
javac -cp .:antlr-3.2.jar *.java
最后执行FuzzyJavaLexerTest.class
文件:
// *nix/MacOS
java -cp .:antlr-3.2.jar FuzzyJavaLexerTest
或:
// Windows
java -cp .;antlr-3.2.jar FuzzyJavaLexerTest
之后您将看到以下内容被打印到您的控制台:
===== source =====
class Test {
String s = " ... \" // no comment ";
/*
* also no comment: // foo
*/
char quote = '"';
// yes, a comment, finally!!!
int i = 0; // another comment
}
==================
Found a SingleLineComment on line 7, starting at column 2, text: // yes, a comment, finally!!!
Found a SingleLineComment on line 8, starting at column 13, text: // another comment
很简单,是吗? :)
答案 1 :(得分:2)
使用解析器,将其确定为char-by-char。
开球示例:
StringBuilder builder = new StringBuilder();
boolean quoted = false;
for (String line : string.split("\\n")) {
for (int i = 0; i < line.length(); i++) {
char c = line.charAt(i);
if (c == '"') {
quoted = !quoted;
}
if (!quoted && c == '/' && i + 1 < line.length() && line.charAt(i + 1) == '/') {
break;
} else {
builder.append(c);
}
}
builder.append("\n");
}
String parsed = builder.toString();
System.out.println(parsed);
答案 2 :(得分:1)
(这是对@finnw在his answer下的评论中提出的问题的回答。对于OP问题的答案并不是对正则表达式为什么是错误工具的扩展解释。)
这是我的测试代码:
String r0 = "(?m)^((?:[^\"]|\"(?:[^\"]|\\\")*\")*)//.*$";
String r1 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n]|\\\")*\")*)//.*$";
String r2 = "(?m)^((?:[^\"\r\n]|\"(?:[^\"\r\n\\\\]|\\\\\")*\")*)//.*$";
String test =
"class Test { \n"+
" String s = \" ... \\\" // no comment \"; \n"+
" /* \n"+
" * also no comment: // but no harm \n"+
" */ \n"+
" /* no comment: // much harm */ \n"+
" char quote = '\"'; // comment \n"+
" // another comment \n"+
" int i = 0; // and another \n"+
"} \n"
.replaceAll(" +$", "");
System.out.printf("%n%s%n", test);
System.out.printf("%n%s%n", test.replaceAll(r0, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r1, "$1"));
System.out.printf("%n%s%n", test.replaceAll(r2, "$1"));
r0
是您回答中编辑的正则表达式;它只删除最终评论(// and another
),因为其他所有内容都在组(1)中匹配。设置多行模式((?m)
)是^
和$
正常工作所必需的,但它无法解决此问题,因为您的字符类仍然可以匹配换行。
r1
处理换行问题,但它仍然错误地匹配字符串文字中的// no comment
,原因有二:您没有在(?:[^\"\r\n]|\\\")
的第一部分中包含反斜杠;你只使用其中两个来匹配第二部分的反斜杠。
r2
修复了这一点,但它没有尝试处理多行注释中char
字面值或单行注释中的引号。它们也可能被处理,但这个正则表达式已经是Baby Godzilla;你真的想看到它都长大了吗?。
答案 3 :(得分:1)
以下是我几年前写的一个类似grep的程序(在Perl中)。它可以在处理文件之前删除java注释:
# ============================================================================
# ============================================================================
#
# strip_java_comments
# -------------------
#
# Strip the comments from a Java-like file. Multi-line comments are
# replaced with the equivalent number of blank lines so that all text
# left behind stays on the same line.
#
# Comments are replaced by at least one space .
#
# The text for an entire file is assumed to be in $_ and is returned
# in $_
#
# ============================================================================
# ============================================================================
sub strip_java_comments
{
s!( (?: \" [^\"\\]* (?: \\. [^\"\\]* )* \" )
| (?: \' [^\'\\]* (?: \\. [^\'\\]* )* \' )
| (?: \/\/ [^\n] *)
| (?: \/\* .*? \*\/)
)
!
my $x = $1;
my $first = substr($x, 0, 1);
if ($first eq '/')
{
"\n" x ($x =~ tr/\n//);
}
else
{
$x;
}
!esxg;
}
此代码确实可以正常工作,不会被棘手的评论/引用组合所迷惑。它可能会被unicode转义(\ u0022等)所愚弄,但如果你愿意的话,你可以轻松地处理它们。
因为它是Perl,而不是java,替换代码必须改变。我会快速破解产生等效的java。待命...
编辑:我刚刚发誓。可能需要工作:
// The trick is to search for both comments and quoted strings.
// That way we won't notice a (partial or full) comment withing a quoted string
// or a (partial or full) quoted-string within a comment.
// (I may not have translated the back-slashes accurately. You'll figure it out)
Pattern p = Pattern.compile(
"( (?: \" [^\"\\\\]* (?: \\\\. [^\"\\\\]* )* \" )" + // " ... "
" | (?: ' [^'\\\\]* (?: \\\\. [^'\\\\]* )* ' )" + // or ' ... '
" | (?: // [^\\n] * )" + // or // ...
" | (?: /\\* .*? \\* / )" + // or /* ... */
")",
Pattern.DOTALL | Pattern.COMMENTS
);
Matcher m = p.matcher(entireInputFileAsAString);
StringBuilder output = new StringBuilder();
while (m.find())
{
if (m.group(1).startsWith("/"))
{
// This is a comment. Replace it with a space...
m.appendReplacement(output, " ");
// ... or replace it with an equivalent number of newlines
// (exercise for reader)
}
else
{
// We matched a quoted string. Put it back
m.appendReplacement(output, "$1");
}
}
m.appendTail(output);
return output.toString();
答案 4 :(得分:0)