文档

Question

在Java 8之前，当我们拆分空字符串时，如

String[] tokens = "abc".split("");

拆分机制会在标有|

的地方拆分

|a|b|c|

因为每个字符前后都存在空格""。因此，它最初将生成此数组

["", "a", "b", "c", ""]

以后会remove trailing empty strings（因为我们没有明确地为limit参数提供负值）所以最终会返回

["", "a", "b", "c"]

在Java 8中拆分机制似乎已经发生了变化。现在我们使用

"abc".split("")

我们将获得["a", "b", "c"]数组而不是["", "a", "b", "c"]，因此看起来开头的空字符串也会被删除。但是这个理论失败了，因为例如

"abc".split("a")

在start ["", "bc"]返回带有空字符串的数组。

有人可以解释一下这里发生了什么，以及这些案例的拆分规则在Java 8中是如何改变的？

Answer 1

String.split（调用Pattern.split）的行为在Java 7和Java 8之间发生了变化。

文档

比较Java 7和Java 8中Pattern.split的文档，我们会看到添加了以下条款：

如果在输入序列的开头存在正宽度匹配，则在结果数组的开头包含空的前导子字符串。然而，开头的零宽度匹配从不会产生这样的空前导子串。

与Java 8相比，同一条款也添加到Java 7的String.split。

参考实施

让我们比较Java 7和Java 8中参考实现的Pattern.split代码。从grepcode中检索代码，版本为7u40-b43和8-b132。

Java 7

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

Java 8

public String[] split(CharSequence input, int limit) {
    int index = 0;
    boolean matchLimited = limit > 0;
    ArrayList<String> matchList = new ArrayList<>();
    Matcher m = matcher(input);

    // Add segments before each match found
    while(m.find()) {
        if (!matchLimited || matchList.size() < limit - 1) {
            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }
            String match = input.subSequence(index, m.start()).toString();
            matchList.add(match);
            index = m.end();
        } else if (matchList.size() == limit - 1) { // last one
            String match = input.subSequence(index,
                                             input.length()).toString();
            matchList.add(match);
            index = m.end();
        }
    }

    // If no match was found, return this
    if (index == 0)
        return new String[] {input.toString()};

    // Add remaining segment
    if (!matchLimited || matchList.size() < limit)
        matchList.add(input.subSequence(index, input.length()).toString());

    // Construct result
    int resultSize = matchList.size();
    if (limit == 0)
        while (resultSize > 0 && matchList.get(resultSize-1).equals(""))
            resultSize--;
    String[] result = new String[resultSize];
    return matchList.subList(0, resultSize).toArray(result);
}

在Java 8中添加以下代码排除了输入字符串开头的零长度匹配，这解释了上述行为。

            if (index == 0 && index == m.start() && m.start() == m.end()) {
                // no empty leading substring included for zero-width match
                // at the beginning of the input char sequence.
                continue;
            }

维护兼容性

以下Java 8及以上版本中的行为

使split在不同版本中表现一致，并与Java 8中的行为兼容：

如果您的正则表达式可以匹配零长度字符串，只需在正则表达式的结束处添加(?!\A)，并将原始正则表达式包装在非捕获组中(?:...)（如有必要）。
如果您的正则表达式无法匹配零长度字符串，则无需执行任何操作。

如果您不知道正则表达式是否可以匹配零长度字符串，请执行步骤1中的两个操作。

(?!\A)检查字符串是否不在字符串的开头结束，这意味着匹配是字符串开头的空匹配。

在Java 7和之前的
中遵循以下行为
没有通用的解决方案使split向后兼容Java 7及之前的版本，而不是将split的所有实例替换为指向您自己的自定义实现。

Answer 2

这已在split(String regex, limit)的文档中指定。

当此字符串开头有正宽度匹配时然后在开头包含一个空的前导子串结果数组。然而，开头的零宽度匹配从不产生这样空的前导子串。

在"abc".split("")中，您在开头有一个零宽度匹配，因此前导空子字符串不包含在结果数组中。

然而，在你在"a"上拆分的第二个片段中，你得到了一个正宽度匹配（在这种情况下为1），因此按预期包含空的前导子字符串。

（删除了不相关的源代码）

Answer 3

从Java 7到Java 8的split()文档略有变化。具体来说，添加了以下语句：

如果在此字符串的开头存在正宽度匹配，则在结果数组的开头包含空的前导子字符串。 开头的零宽度匹配但是从不会产生这样的空前导子串。

^{（强调我的）}

空字符串拆分在开头生成零宽度匹配，因此根据上面指定的内容，在结果数组的开头不包含空字符串。相比之下，在"a"上拆分的第二个示例在字符串的开头生成正 - 宽度匹配，因此实际上在结果数组的开头包含一个空字符串。

为什么在Java 8 split中有时会在结果数组的开头删除空字符串？

3 个答案:

文档

参考实施

Java 7

Java 8

维护兼容性

以下Java 8及以上版本中的行为

在Java 7和之前的