Question

我试图从任何文本中标记单词，例如：

Ça me plaît.

应该被标记为“ça，me，plaît”。为此，我想从所有特殊字符中清除字符串，然后将其拆分为空格。使用此代码：

text = text.toLowerCase().replaceAll(/^\w/, ' ')
def tokens = text.split(" ")

我得到了

a me pla t

这远非有用。我需要什么正则表达式？

谢谢！ Mulone

Answer 1

这似乎对我有用（至少在这种情况下）：

'Ça me plaît.'.toLowerCase().replaceAll( /[^\p{javaLowerCase}]/, ' ').split( ' ' )

Answer 2

您可以使用\ S（大写字母S）代替\ w。 \ S匹配所有非白色字符，而\ s（非大写）匹配所有白色字符。

因此，你将拥有

text = text.toLowerCase().replaceAll(/^\S/, ' ')
def tokens = text.split(" ")