Question

我有一些字符串，例如：I: am a string, with "punctuation". 我想将字符串拆分为：

["I", ":", "am", "a", "string", ",", "with", "\"", "punctuation", "\"", "."]

我尝试了text.split("[\\p{Punct}\\s]+")，但结果是I, am, a, string, with, punctuation ...

我找到了this解决方案，但Java不允许我按\w拆分。

Answer 1

使用此正则表达式：

"\\s+|(?=\\p{Punct})|(?<=\\p{Punct})"

字符串的结果：

["I", ":", "am", "a", "string", ",", "with", "", "\"", "punctuation", "\"", "."]

不幸的是，有一个额外的元素，""之后。这些额外的元素只有在空白字符后面有一个标点字符时才会出现（并且总是会出现），所以这可以通过执行myString.replaceAll("\\s+(?=\\p{Punct})", "").split(regex);而不是myString.split(regex);来修复（即在拆分之前删除空格）< / p>

这是如何运作的：

\\s+拆分一组空格，因此如果字符是空格字符，我们将删除这些字符并在该位置拆分。 _{（注意：我假设一串hello world应该导致["hello", "world"]而不是["hello", "", "world"]）}
(?=\\p{Punct})是一个前瞻，但它不会删除该字符。
(?<=\\p{Punct})是一个lookbehind，如果最后一个字符是标点字符，则会分裂。

编辑：

在回复your comment时，此正则表达式应允许在单词中使用标点符号：

"\\s+|(?=\\W\\p{Punct}|\\p{Punct}\\W)|(?<=\\W\\p{Punct}|\\p{Punct}\\W})"

对于这个，您不需要使用replaceAll，只需执行myString.split(regex)。

工作原理：

这个正则表达式非常相似，但外观发生了变化。 \\W\\p{Punct}匹配非单词字符，后跟标点字符。 \\p{Punct}\\W匹配标点字符后跟非单词字符。因此，如果有一个标点符号不在单词的中间，则每个环视匹配。

Answer 2

或者试试这个，收集一个ArrayList：

    String s = "I: am a string, with \"punctuation\".";
    Pattern pat = Pattern.compile( "\\w+|\\S" );

    Matcher mat = pat.matcher( s );
    while( mat.find() ){
        System.out.print( mat.group() +  "/" );
    }
    System.out.println();

输出：

 I/:/am/a/string/,/with/"/punctuation/"/./

拆分字符串并用标点符号和空格分隔

2 个答案: