Question

我在构造正则表达式时遇到问题，该正则表达式允许使用除2个字符外的所有UTF-8字符：'_'和'？'

所以白名单是：^ [\ u0000- \ uFFFF] 黑名单是：^ [^ _％]

我需要将这些组合成一个表达式。

我尝试了以下代码，但是没有按照我希望的方式运行：

    String input = "this";
    Pattern p = Pattern
            .compile("^[\u0000-\uFFFF]+$ | ^[^_%]");
    Matcher m = p.matcher(input);
    boolean result = m.matches();
    System.out.println(result);

输入：这个实际输出：假
期望的输出：true

Answer 1

您可以在Java regex中使用character class intersections/subtractions来限制＆＃34;泛型＆＃34;人物类。

字符类[a-z&&[^aeiuo]]匹配不是元音的单个字母。换句话说：它匹配一个辅音。

使用

"^[\u0000-\uFFFF&&[^_%]]+$"

匹配除_和%以外的所有Unicode字符。

有关Java正则表达式中可用的字符类交叉/减法的更多信息，请参阅The Java™ Tutorials: Character Classes。

OCPSoft Visual Regex Tester的测试显示，当%添加到字符串时，没有匹配项：

Java demo：

String input = "this";
Pattern p = Pattern.compile("[\u0000-\uFFFF&&[^_%]]+"); // No anchors because `matches()` is used
Matcher m = p.matcher(input);
boolean result = m.matches();
System.out.println(result); // => true

Answer 2

下面是一个示例代码，用于排除使用Lookahead and Lookbehind Zero-Length Assertions的范围中的某些字符，这些字符实际上不消耗字符串中的字符，但仅断言是否可以匹配。

示例代码:(从范围m中排除n和a-z）

    String str = "abcdmnxyz";
    Pattern p=Pattern.compile("(?![mn])[a-z]");
    Matcher m=p.matcher(str);
    while(m.find()){
        System.out.println(m.group());
    }

输出：

a b c d x y z

以同样的方式你可以做到。

正则表达式解释(?![mn])[a-z]

  (?!                      look ahead to see if there is not:   
    [mn]                     any character of: 'm', 'n' 
  )                        end of look-ahead    
  [a-z]                    any character of: 'a' to 'z'

您可以在子范围内划分整个范围，也可以使用([a-l]|[o-z])或[a-lo-z]正则表达式来解决上述问题。

Answer 3

你的问题是管道两侧的空间。

" ^.*"
".*$ "

将匹配任何内容，因为在开始或结束之前没有任何内容。

这有机会：

^[\u0000-\uFFFF]+$|^[^_%]

在java正则表达式中组合白名单和黑名单

3 个答案: