Question

我正在开发一个Java项目，需要嵌套字符串。

对于纯文本的输入字符串如下所示：

这是＆＃34;一个字符串＆＃34;这是＆＃34;一个\＆＃34;嵌套\＆＃34;字符串＆＃34;

结果必须如下：

[0] This
[1] is
[2] "a string"
[3] and
[4] this
[5] is
[6] "a \"nested\" string"

注意我希望保留\"个序列我有以下方法：

public static String[] splitKeepingQuotationMarks(String s);

我需要根据给定的规则从给定的s参数创建一个字符串数组，而不使用 Java Collection Framework 或其派生词。

我不确定如何解决这个问题可以使用正则表达式来解决这个问题吗？

根据评论问题更新：

每个未转义的"的关闭未转义"（均衡）
如果我们想创建代表它的文字，则必须转义每个转义字符\（要创建代表\的文字，我们需要将其写为\\）。

Answer 1

您可以使用以下正则表达式：

"[^"\\]*(?:\\.[^"\\]*)*"|\S+

请参阅regex demo

Java demo：

String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\""; 
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    System.out.println(matcher.group(0));
}

<强>解释：

"[^"\\]*(?:\\.[^"\\]*)*" - 双引号，后面跟"和\（[^"\\]）以外的任何0 +字符，后跟任意转义序列的0+序列（\\.）后面跟着"和\
| - 或......
\S+ - 一个或多个非空白字符

注意

@Pshemo's suggestion - "\"(?:\\\\.|[^\"])*\"|\\S+"（或"\"(?:\\\\.|[^\"\\\\])*\"|\\S+"更正确） - 是相同的表达式，但效率低得多，因为它使用的是用*量化的交替组。由于正则表达式引擎必须测试每个位置，因此该构造涉及更多的回溯，并且每个位置有2个概率。我的基于 unroll-the-loop 的版本将同时匹配文本块，因此更快更可靠。

<强>更新

由于需要String[]类型作为输出，您需要分两步执行：计算匹配项，创建数组，然后再次重新运行匹配器：

int cnt = 0; String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\""; Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+"); Matcher matcher = ptrn.matcher(str); while (matcher.find()) { cnt++; } System.out.println(cnt); String[] result = new String[cnt]; matcher.reset(); int idx = 0; while (matcher.find()) { result[idx] = matcher.group(0); idx++; } System.out.println(Arrays.toString(result));

请参阅another IDEONE demo

Answer 2

另一种有效的正则表达式方法使用负面的后视：＆＃34;单词＆＃34; （\w+）或＆＃34; 引用后跟ISN＆＃39; T前面加上反斜杠＆＃34;的下一个引号，并将您的匹配设置为＆＃34; global＆＃34; （不要在第一场比赛中回归）

(\w+|".*?(?<!\\)")

see it here

Answer 3

不使用正则表达式的替代方法：

import java.util.ArrayList;
import java.util.Arrays;

public class SplitKeepingQuotationMarks {
    public static void main(String[] args) {
        String pattern = "This is \"a string\" and this is \"a \\\"nested\\\" string\"";
        System.out.println(Arrays.toString(splitKeepingQuotationMarks(pattern)));
    }

    public static String[] splitKeepingQuotationMarks(String s) {
        ArrayList<String> results = new ArrayList<>();
        StringBuilder last = new StringBuilder();
        boolean inString = false;
        boolean wasBackSlash = false;

        for (char c : s.toCharArray()) {
            if (Character.isSpaceChar(c) && !inString) {
                if (last.length() > 0) {
                    results.add(last.toString());
                    last.setLength(0); // Clears the s.b.
                }
            } else if (c == '"') {
                last.append(c);
                if (!wasBackSlash)
                    inString = !inString;
            } else if (c == '\\') {
                wasBackSlash = true;
                last.append(c);
            } else
                last.append(c); 
        }

        results.add(last.toString());
        return results.toArray(new String[results.size()]);
    }
}

输出：

[这是，＆＃34;一个字符串＆＃34;，并且，这是，＆＃34;一个\＆＃34;嵌套\＆＃34;字符串＆＃34;]

拆分嵌套字符串，保留引号

3 个答案: