Question

当使用split（）时，正则表达式允许我保留所有单词字符，但也会保留收缩，就像不会。撇号两侧都有单词字符，但删除任何前导或尾随的撇号，如'tis或dogs'。

我有：

String [] words = line.split("[^\\w'+]+[\\w+('*?)\\w+]");

但它保留了前导和尾随标点符号。

输入'Tis the season, for the children's happiness'.

会产生以下输出：Tis the season for the children's happiness

有什么建议吗？

Answer 1

我想：分开：

要么撇号+至少一个非字的字符['-]\\W+，

或任何无字字符[^\\w'-]\\W*。

String line = "'Tis the season, for the children's happiness'";
String[] words = line.split("(['-]\\W+|[^\\w'-]\\W*)");
System.out.println(Arrays.toString(words));

这里我添加了-作为撇号的补充。

结果：

['Tis, the, season, for, the, children's, happiness']

添加开始和结束：

    String[] words = line.split("(^['-]|['-]$|['-]\\W+|[^\\w'-]\\W*)");

结果：

[, Tis, the, season, for, the, children's, happiness]

开头会产生一个空字符串。

Answer 2

或者，您可以只匹配模式：

\w+('\w+)?

Answer 3

英语很烂。考虑以下 cockney：

<块引用>

“Jane 说，''Sam 带着南瓜灯的儿子会很吓人的！'，”双胞胎的鬼魂异口同声地说。

所有单词都匹配使用：

('?[\p{L}](-[^-])?('-)?(s'(?=\s))?)+

返回 16 个匹配项：

<块引用>

"Jane said, ''E'll be spooky, Sam's son with {{1 }} the!'," jack-o'-lantern said the twins'---ghosts in。

请注意，twins' 是所有格，而不是收缩，并且是匹配的。然而，Sam's 也是一个所有格，但与收缩没有区别——它需要一个精心设计的例外条款，因为 it's 不是它的'所有格：这是它的。

这将不包括 幸福' 中的撇号，因为没有简单的方法可以判断它是结束单引号还是所有格。

Java正则表达式分裂保持收缩

3 个答案: