Question

我有一个用于验证UTF-8字符的正则表达式。

String regex = "[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]*"

我也想进行范围检查，所以我将其修改为

String regex = "[[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]*]"
String rangeRegex = regex + "{0,30}"

请注意，这与我用[ ]包装它的正则表达式相同。

现在我可以使用rangeRegex验证范围，但regex现在不验证UTF-8字符。

我的问题是：[]如何影响regex？如果我从原始正则表达式中删除[]，它将验证UTF-8字符但不包括范围。如果我放[]它将使用范围进行验证，但不会没有范围！

示例测试代码 -

public class Test {

    static String regex =  "[[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]*]" ;
    public static void main(String[] args) {
        String userId = null;
        //testUserId(userId);
        userId = "";
        testUserId(userId);
        userId = "æÆbBcCćĆčČçďĎǳǲdzsDzs";
        testUserId(userId);
        userId = "test123";
        testUserId(userId);
        userId = "abcxyzsd";
        testUserId(userId);

        String zip = "i«♣│axy";
        testZip(zip);
        zip = "331fsdfsdfasdfasd02c3";
        testZip(zip);
        zip = "331";
        testZip(zip);

    }

    /**
     * without range check
     * @param userId
     */
    static void testUserId(String userId){
        boolean pass = true;
        if ( !stringValidator(userId, regex)) {
            pass = false;
        }
        System.out.println(pass);
    }

    /**
     * with a range check
     * @param zip
     */
    static void testZip(String zip){
        boolean pass = true;
        String regex1 = regex + "{0,10}";
        if (StringUtils.isNotBlank(zip) && !stringValidator(zip, regex1)) {
            pass = false;
        }
        System.out.println(pass);
    }

    static boolean stringValidator(String str, String regex) {
        Pattern pattern = Pattern.compile(regex);
        Matcher matcher = pattern.matcher(str);
        return matcher.matches();
    }
}

Answer 1

对于Java正则表达式，给出的解释是错误的。

在Java中，字符类中未转义的成对方括号不会被视为文字[和]字符。它们在Java character classes中具有特殊含义：

[a-d[m-p]] a到d，或m到p：[a-dm-p] （联盟）
  [a-z&&[def]] d，e或f （交叉点）
  [a-z&&[^bc]] a到z，b和c除外：[ad-z] （减法）
  [a-z&&[^m-p]] a到z，而不是m到p：[a-lq-z] （减法）

因此，当您向正则表达式添加[...]时，您会获得前一个正则表达式模式与文字*字符的联合，并表示匹配[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]或文字* 的。

此外，[[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]*]等于[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}*]，因为字符类中的 *符号不再是特殊字符（量词）并成为字面星号符号

如果您使用[[]]，引擎会抛出异常：Unclosed character class near index 3

请参阅this IDEONE demo：

System.out.println("abc[]".replaceAll("[[abc]]", "")); // => []
System.out.println("abc[]".replaceAll("[[]]", "")); // => error

每当您需要检查带有正则表达式的字符串的长度时，您需要anchors和limiting quantifier。当正则表达式与Matcher#matches method一起使用时，会自动添加锚点：

matches方法会尝试将整个输入序列与模式匹配。

示例代码：

String regex = "[\\p{L}\\p{M}\\p{N}\\p{P}\\p{Z}\\p{S}\\p{C}]";
String new_regex = regex + "{0,30}"; 
System.out.println("Some string".matches(new_regex)); // => true

请参阅this IDEONE demo

<强>更新

以下是commented code of yours：

String userId = "";
testUserId(userId); // false - Correct as we test an empty string with an at-least-one-char regex
userId = "æÆbBcCćĆčČçďĎǳǲdzsDzs";
testUserId(userId); // false - Correct as we only match 1 character string, others fail
userId = "test123";
testUserId(userId); // false - see above
userId = "abcxyzsd";
testUserId(userId); // false - see above

String zip = "i«♣│axy";
testZip(zip);                    // true - OK, 7-symbol string matches against [...]{0,10} regex
zip = "331fsdfsdfasdfasd02c3";
testZip(zip);                 // false - OK, 21-symbol string does not match a regex that requires only 0 to 10 characters
zip = "331";          
testZip(zip);                // true - OK, 3-symbol string matches against [...]{0,10} regex

Answer 2

*表示0或更多，因此它几乎就像{0,}。即，您可以将*替换为{0,30}，它应该可以执行您想要的操作：

[\p{L}\p{M}\p{N}\p{P}\p{Z}\p{S}\p{C}]{0,30}

[]创建了一个字符类，因此[[]]将是＆＃34;自[以来]后面的]字符类.{0,30}过早地关闭角色类，并没有真正做你想做的事。

如果我错了，请纠正我，但你生成的角色列表几乎就是一切，所以你可以使用git add -A来达到同样的效果。

[]如何在Java正则表达式中有所作为？

2 个答案: