Java Regular Expression特殊字符转义

时间:2015-07-02 19:55:18

标签: java regex

我正在尝试创建一个正则表达式,除了少数几个字符外几乎接受美国键盘上的每个字符。这就是我目前所拥有的(并非所有内容都包括在内):

^[a-zA-Z0-9!~`@#$%\\^]

现在我知道^是我遇到的第一个需要在它面前逃脱的角色。当我放一个\时,我得到一个编译错误(无效的转义序列)。当我针对String运行它时,它完全忽略^规则。谁知道我做错了什么?

2 个答案:

答案 0 :(得分:5)

由于您使用的是character class,因此您无需逃离^,只需使用:

^[a-zA-Z0-9!~`@#$%^]

[ ... ]使用的字符类允许您放置所需的字符,特殊字符在方括号内不再特殊。您应该逃脱的唯一情况是,如果您使用的是\d\w等快捷范围,因为您在java中使用反斜杠,那么您需要将其作为\\d\\w(但仅仅因为java,而不是正则表达式引擎)。

例如:

"a".matches("^[a-zA-Z0-9!~`@#$%^]");
"asdf".matches("^[a-zA-Z0-9!~`@#$%^]+"); // for multiple characters

答案 1 :(得分:1)

当你想要按字面意思匹配时,你只需要转义^,也就是说,你想要查找包含^字符的文本。

如果您打算使用具有特殊含义的^(行/字符串的开头),则无需转义它。只需输入

即可
"^[a-zA-Z0-9!~`@#$%\\^]"

在你的源代码中。朝向此正则表达式末尾的反斜杠无关紧要。由于Java中反斜杠的特殊含义,您需要输入2个反斜杠,但这与其处理正则表达式无关。正则表达式引擎接收一个反斜杠,用于将以下字符作为文字读取,但^无论如何都是括号内的文字。

详细说明您对[和]的评论:

括号在正则表达式中具有特殊含义,因为它们基本上形成由模式给出的字符列表的边界(所提到的字符形成所谓的字符类)。让我们从上面分解正则表达式以使事情清楚。

^ Matches the start of the text
[ Opening boundary of your character class
a-z Lower case letters of A to Z
A-Z Upper case letters of A to Z
0-9 Numbers from 0 to 9
! Exclamation mark, literally
~ Tilde, literally
` Backtick, literally
@ The @ character, literally
# Hash, literally
$ Dollar, literally
% Percent sign, literally
\\ Backslash. Regular expression engine only receives single backslash as the other backslash is consumed by Java's syntax for Strings. Would be used to mark following character as literal but ^ is a literal in character class definitions anyway so theses backslashes are ignored.
^ Caret, literally
] Closing boundary of your character class

字符类定义中的模式顺序无关紧要。 如果检查文本的第一个字符是字符类定义的一部分,则上面的表达式匹配匹配。如果检查文本中的其他字符很重要,则取决于您如何使用正则表达式。

当您从正则表达式开始时,您应该始终使用多个测试文本来匹配并验证行为。建议将这些测试用例作为单元测试,以高度确信程序的正确行为。

测试表达式的简单代码示例如下:

public class Test {
    public static void main(String[] args) {
        String regexp = "^[ a-zA-Z0-9!~`@#$%\\\\^\\[\\]]+$";
        String[] testdata = new String[] {
                "abc",
                "2332",
                "some@test",
                "test [ and ] test end",
                // Following sample will not match the pattern.
                "äöüßµøł"
        };
        for (String toExamine : testdata) {
            if (toExamine.matches(regexp)) {
                System.out.println("Match: " + toExamine);
            } else {
                System.out.println("No match: " + toExamine);
            }
        }
    }
}

注意我在这里使用修改后的模式。它确保检查的字符串中的所有字符都与您的字符类匹配。我确实扩展了字符类以允许\和空格和[和]。 分解的描述是:

^ Matches the start of the text
[ Opening boundary of your character class
a-z Lower case letters of A to Z
A-Z Upper case letters of A to Z
0-9 Numbers from 0 to 9
! Exclamation mark, literally
~ Tilde, literally
` Backtick, literally
@ The @ character, literally
# Hash, literally
$ Dollar, literally
% Percent sign, literally
\\\\ Backslash, literally. Regular expression engine only receives 2 backslashes as every other backslash is consumed by Java's syntax for Strings. The first backslash is seen as marking the second backslash a occurring literally in the string.
^ Caret, literally
\\[ Opening bracket, literally. The backslash makes the bracket loose its meaning as opening a character class definition.
\\] Closing bracket, literally. The backslash makes the bracket loose its meaning as closing a character class definition.
] Closing boundary of your character class
+ Means any number of characters matching your character class definition can occur, but at least 1 such character needs to be present for a match
$ Matches the start of the text

我不知道的一件事是为什么人们会使用美国键盘的字符作为验证的标准。