将句子分成单独的单词,如果在句末,则将句号分为句号

时间:2014-05-28 00:24:07

标签: javascript regex text tokenize

我需要标记化并处理基于编程语言的字符串。

例如,让我们使用以下字符串:

"      THE QUICK BROWN FOX    JUMPED-OVER THE LAZY(2) DOG." 

在javascript中,我可以执行以下操作将其拆分为数组:

var v = "      THE QUICK BROWN FOX   JUMPED-OVER THE LAZY(2) DOG.".match(/\S+/g);

这导致以下数组:

["THE", "QUICK", "BROWN", "FOX", "JUMPED-OVER", "THE", "LAZY(2)", "DOG."]

如何更改匹配上的正则表达式以使fullstop成为单独的元素,从而产生以下输出:

["THE", "QUICK", "BROWN", "FOX", "JUMPED-OVER", "THE", "LAZY(2)", "DOG", "."]

请注意:

  • 我不能使用\ w,因为它将2分成单独的标记,删除括号并删除完整停止。
  • 这不是一个重复的问题,因为关于分割句子的其他问题并没有处理完整的问题,同时充分处理括号。
  • 如果通过正则表达式无法做到这一点,那么是否可以从最后一个令牌中删除fullstop,以便最后一个令牌变为“DOG”?

2 个答案:

答案 0 :(得分:2)

您可以匹配\S的否定倒数并将.添加到类中,如下所示:

/[^\s.]+/g

结果给出:

"      THE QUICK BROWN FOX   JUMPED-OVER THE LAZY(2) DOG.".match(/[^\s.]+/g)
["THE", "QUICK", "BROWN", "FOX", "JUMPED-OVER", "THE", "LAZY(2)", "DOG"]

这只是从比赛中删除了一段时间。

将结束时段添加回匹配项:

"      THE QUICK BROWN FOX   JUMPED-OVER THE LAZY(2) DOG.".match(/[^\s.]+|\.$/g)
["THE", "QUICK", "BROWN", "FOX", "JUMPED-OVER", "THE", "LAZY(2)", "DOG", "."]

答案 1 :(得分:1)

"."添加空格然后匹配

var v = "      THE QUICK BROWN FOX   JUMPED-OVER THE LAZY(2) DOG.".replace(".", " .").match(/\S+/g);

console.log(v);

结果:

["THE", "QUICK", "BROWN", "FOX", "JUMPED-OVER", "THE", "LAZY(2)", "DOG", "."]