更新：在Java 11中，下面描述的bug似乎已修复

Question

更新：在Java 11中，下面描述的bug似乎已修复

（可能它甚至更早修复，但我不知道究竟是哪个版本。Bug report关于nhahtdh's answer中链接的类似问题建议Java 9）。

TL; DR （修复前）：
为什么[^\\D2]，[^[^0-9]2]，[^2[^0-9]]在Java中获得不同的结果？

用于测试的代码。你现在可以跳过它。

String[] regexes = { "[[^0-9]2]", "[\\D2]", "[013-9]", "[^\\D2]", "[^[^0-9]2]", "[^2[^0-9]]" };
String[] tests = { "x", "1", "2", "3", "^", "[", "]" };

System.out.printf("match | %9s , %6s | %6s , %6s , %6s , %10s%n", (Object[]) regexes);
System.out.println("-----------------------------------------------------------------------");
for (String test : tests)
    System.out.printf("%5s | %9b , %6b | %7b , %6b , %10b , %10b %n", test,
            test.matches(regexes[0]), test.matches(regexes[1]),
            test.matches(regexes[2]), test.matches(regexes[3]),
            test.matches(regexes[4]), test.matches(regexes[5]));

让我们说我需要正则表达式接受

字符

不是数字，
2除外。

因此，此类正则表达式应代表除0，1，3，4，...，9之外的所有字符。我至少可以通过两种方式来编写它，这将是所有与 2 无关的的总和：

[[^0-9]2]
[\\D2]

这两个正则表达式都按预期工作

match , [[^0-9]2] ,  [\D2]
--------------------------
    x ,      true ,   true
    1 ,     false ,  false
    2 ,      true ,   true
    3 ,     false ,  false
    ^ ,      true ,   true
    [ ,      true ,   true
    ] ,      true ,   true

现在我想说我想要反转接受的字符。 （所以我想接受除2之外的所有数字）我可以创建显式包含所有接受的字符的正则表达式，如

[013-9]

或尝试通过将其包含在另一个[^...]中来否定之前描述的两个正则表达式

[^\\D2]
[^[^0-9]2]
甚至
[^2[^0-9]]

但令我惊讶的是前两个版本按预期工作

match | [[^0-9]2] ,  [\D2] | [013-9] , [^\D2] , [^[^0-9]2] , [^2[^0-9]] 
------+--------------------+------------------------------------------- 
    x |      true ,   true |   false ,  false ,       true ,       true 
    1 |     false ,  false |    true ,   true ,      false ,       true 
    2 |      true ,   true |   false ,  false ,      false ,      false 
    3 |     false ,  false |    true ,   true ,      false ,       true 
    ^ |      true ,   true |   false ,  false ,       true ,       true 
    [ |      true ,   true |   false ,  false ,       true ,       true 
    ] |      true ,   true |   false ,  false ,       true ,       true

所以我的问题是为什么[^[^0-9]2]或[^2[^0-9]]不表现为[^\D2]？我可以以某种方式纠正这些正则表达式，以便我可以在其中使用[^0-9]吗？

Answer 1

根据JavaDoc page嵌套类生成两个类的 union ，这使得无法使用该符号创建交集：

要创建联合，只需将一个类嵌套在另一个类中，例如[0-4 [6-8]]。这个特殊的联合创建了一个匹配数字0,1,2,3,4,6,7和8的单个字符类。

要创建交叉点，您必须使用&&：

要创建仅匹配所有嵌套类共有字符的单个字符类，请使用＆amp;＆amp;和[0-9＆amp;＆amp; [345]]。这个特殊的交集会创建一个单独的字符类，只匹配两个字符类共有的数字：3，4和5.

你问题的最后一部分对我来说仍然是一个谜。 [^2]和[^0-9]的联合确实应为[^2]，因此[^2[^0-9]]的行为符合预期。行为与[^[^0-9]2]相似的[^0-9]确实很奇怪。

Answer 2

在Oracle的Pattern类实现的字符类解析代码中有一些奇怪的伏都教，如果你从Oracle的网站上下载它，那么你的JRE / JDK会附带它如果您使用的是OpenJDK。我还没有检查其他JVM（特别是GNU Classpath）实现如何解析问题中的正则表达式。

从这一点来看，对Pattern类及其内部工作的任何引用都严格限于Oracle的实现（参考实现）。

需要一些时间来阅读并理解Pattern类如何解析嵌套否定，如问题所示。但是，我编写了一个程序¹来从Pattern对象（带Reflection API）中提取信息，以查看编译结果。以下输出来自在Java HotSpot Client VM版本1.7.0_51上运行我的程序。

^{1：目前，该计划令人尴尬。当我完成并重构它时，我会用链接更新这篇文章。}

[^0-9]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

这里没什么好吃的。

[^[^0-9]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

[^[^[^0-9]]]
Start. Start unanchored match (minLength=1)
CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
  Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

以下2个案例编译为与[^0-9]相同的程序，反直觉。

[[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

[\D2]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Ctype. Match POSIX character class DIGIT (US-ASCII)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

如上所述，在上述两个案例中没有任何异议。

[013-9]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 2 character(s):
    [U+0030][U+0031]
    01
  Pattern.rangeFor (character range). Match any character within the range from code point U+0033 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

[^\D2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
      Ctype. Match POSIX character class DIGIT (US-ASCII)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

如问题中所述，这2个案例按预期工作。但是，请注意引擎如何补充第一个字符类（\D）并将set差异应用于由剩余部分组成的字符类。

[^[^0-9]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

[^[^[^0-9]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

[^[^[^[^0-9]]]2]
Start. Start unanchored match (minLength=1)
Pattern.setDifference (character class subtraction). Match any character matched by the 1st character class, but NOT the 2nd character class:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
  BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
    [U+0032]
    2
LastNode
Node. Accept match

通过Keppil在评论中的测试证实，上面的输出显示上面的所有3个正则表达式都被编译到同一个程序中！

[^2[^0-9]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
      [U+0032]
      2
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

而不是NOT(UNION(2, NOT(0-9))，而0-13-9，我们得到UNION(NOT(2), NOT(0-9))，相当于NOT(2)。

[^2[^[^0-9]]]
Start. Start unanchored match (minLength=1)
Pattern.union (character class union). Match any character matched by either character classes below:
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    BitClass. Optimized character class with boolean[] to match characters in Latin-1 (code point <= 255). Match the following 1 character(s):
      [U+0032]
      2
  CharProperty.complement (character class negation). Match any character NOT matched by the following character class:
    Pattern.rangeFor (character range). Match any character within the range from code point U+0030 to code point U+0039 (both ends inclusive)
LastNode
Node. Accept match

由于同样的错误，正则表达式[^2[^[^0-9]]]编译为与[^2[^0-9]]相同的程序。

有一个未解决的错误似乎具有相同的性质：JDK-6609854。

说明

初步

以下是Pattern课程的实施细节，在进一步阅读之前应该知道：

Pattern类将String编译成一个节点链，每个节点负责一个小而明确定义的职责，并将工作委托给链中的下一个节点。 Node类是所有节点的基类。
CharProperty class是所有与字符类相关的Node的基类。
BitClass class是CharProperty类的子类，它使用boolean[]数组来加速Latin-1字符的匹配（代码点＆lt; = 255）。它有一个add方法，允许在编译期间添加字符。
CharProperty.complement，Pattern.union，Pattern.intersection是与集合操作相对应的方法。他们所做的是不言自明的。
Pattern.setDifference是asymmetric set difference。

乍一看解析角色类

在查看CharProperty clazz(boolean consume)方法的完整代码之前，这是一个负责解析字符类的方法，让我们看一下极其简化的代码版本，以了解代码的流程：

private CharProperty clazz(boolean consume) {
    // [Declaration and initialization of local variables - OMITTED]
    BitClass bits = new BitClass();
    int ch = next();
    for (;;) {
        switch (ch) {
            case '^':
                // Negates if first char in a class, otherwise literal
                if (firstInClass) {
                    // [CODE OMITTED]
                    ch = next();
                    continue;
                } else {
                    // ^ not first in class, treat as literal
                    break;
                }
            case '[':
                // [CODE OMITTED]
                ch = peek();
                continue;
            case '&':
                // [CODE OMITTED]
                continue;
            case 0:
                // [CODE OMITTED]
                // Unclosed character class is checked here
                break;
            case ']':
                // [CODE OMITTED]
                // The only return statement in this method
                // is in this case
                break;
            default:
                // [CODE OMITTED]
                break;
        }
        node = range(bits);

        // [CODE OMITTED]
        ch = peek();
    }
}

代码基本上读取输入（输入String转换为以null结尾 int[]代码点），直到它到达]或结束String（未闭合的字符类）。

代码有点令人困惑，continue和break在switch块内混合在一起。但是，只要您意识到continue属于外部for循环且break属于switch块，代码就很容易理解：

以continue结尾的案例永远不会在switch陈述后执行代码。
以break结尾的案例可能会在switch语句后执行代码（如果它还没有return）。

通过上面的观察，我们可以看到，只要发现某个字符非特殊并且应该包含在字符类中，我们就会在switch之后执行代码声明，其中node = range(bits);是第一个声明。

如果您检查source code，方法CharProperty range(BitClass bits)会解析＆＃34;字符类中的单个字符或字符范围＆＃34;。该方法返回传入的相同BitClass对象（添加了新字符）或返回CharProperty类的新实例。

血腥细节

接下来，让我们看一下代码的完整版本（省略了部分解析字符类交集&&）：

private CharProperty clazz(boolean consume) {
    CharProperty prev = null;
    CharProperty node = null;
    BitClass bits = new BitClass();
    boolean include = true;
    boolean firstInClass = true;
    int ch = next();
    for (;;) {
        switch (ch) {
            case '^':
                // Negates if first char in a class, otherwise literal
                if (firstInClass) {
                    if (temp[cursor-1] != '[')
                        break;
                    ch = next();
                    include = !include;
                    continue;
                } else {
                    // ^ not first in class, treat as literal
                    break;
                }
            case '[':
                firstInClass = false;
                node = clazz(true);
                if (prev == null)
                    prev = node;
                else
                    prev = union(prev, node);
                ch = peek();
                continue;
            case '&':
                // [CODE OMITTED]
                // There are interesting things (bugs) here,
                // but it is not relevant to the discussion.
                continue;
            case 0:
                firstInClass = false;
                if (cursor >= patternLength)
                    throw error("Unclosed character class");
                break;
            case ']':
                firstInClass = false;

                if (prev != null) {
                    if (consume)
                        next();

                    return prev;
                }
                break;
            default:
                firstInClass = false;
                break;
        }
        node = range(bits);

        if (include) {
            if (prev == null) {
                prev = node;
            } else {
                if (prev != node)
                    prev = union(prev, node);
            }
        } else {
            if (prev == null) {
                prev = node.complement();
            } else {
                if (prev != node)
                    prev = setDifference(prev, node);
            }
        }
        ch = peek();
    }
}

查看case '[':语句switch中的代码和switch语句后面的代码：

node变量存储解析单元的结果（独立字符，字符范围，速记字符类，POSIX / Unicode字符类或嵌套字符类）
prev变量存储到目前为止的编译结果，并且在我们在node中编译单元后立即更新。

由于记录字符类是否被否定的局部变量boolean include永远不会传递给任何方法调用，因此它只能在此方法中单独执行。并且在include语句之后读取和处理唯一的switch位置。

正则表达式字符类的双重否定中的错误？

更新：在Java 11中，下面描述的bug似乎已修复

（可能它甚至更早修复，但我不知道究竟是哪个版本。Bug report关于nhahtdh's answer中链接的类似问题建议Java 9）。

2 个答案:

说明

初步

乍一看解析角色类

血腥细节

正在建设中