如何使用javascript将字符串解析为单词和标点符号

时间:2014-07-12 23:54:46

标签: javascript regex

我有一个字符串测试="你好,你们都在做什么,我希望它很好!好的。期待见到你。

我正在尝试使用javascript将字符串解析为单词和标点符号。我可以分开单词,但是使用正则表达式标点符号消失了

var result = test.match(/ \ b(\ w |')+ \ b / g);

所以我的预期输出是

hello
how 
are 
you
all
doing
,
I
hope
that
it's
good
!
and 
fine
.
Looking
forward
to
see
you

2 个答案:

答案 0 :(得分:10)

简单方法

这是第一种方法,如果你和javascript的“单词”定义匹配。下面是一个更加可定制的方法。

试试test.split(/\s*\b\s*/)。它在单词边界(\b)上分裂并吃掉空格。

"hello how are you all doing, I hope that it's good! and fine. Looking forward to see you."
    .split(/\s*\b\s*/);
// Returns:
["hello",
"how",
"are",
"you",
"all",
"doing",
",",
"I",
"hope",
"that",
"it",
"'",
"s",
"good",
"!",
"and",
"fine",
".",
"Looking",
"forward",
"to",
"see",
"you",
"."]

工作原理。

var test = "This is. A test?"; // Test string.

// First consider splitting on word boundaries (\b).
test.split(/\b/); //=> ["This"," ","is",". ","A"," ","test","?"]
// This almost works but there is some unwanted whitespace.

// So we change the split regex to gobble the whitespace using \s*
test.split(/\s*\b\s*/) //=> ["This","is",".","A","test","?"]
// Now the whitespace is included in the separator
// and not included in the result.

更多涉及的解决方案。

如果您希望将“isn`t”和“one-thousand”等单词视为单个单词,而javascript正则表达式将其视为两个单词,则需要创建自己的单词定义。

test.match(/[\w-']+|[^\w\s]+/g) //=> ["This","is",".","A","test","?"]

如何运作

这使用替换分别匹配标点字符的实际单词。正则表达式[\w-']+的前半部分匹配您认为是单词的任何内容,而后半部分[^\w\s]+匹配您认为标点符号的任何内容。在这个例子中,我只使用了不是单词或空格的东西。我也只是一个+,以便多字符标点符号(例如?!正确写入!)被视为单个字符,如果你不想删除+

答案 1 :(得分:4)

使用此:

[,.!?;:]|\b[a-z']+\b

查看the demo中的匹配项。

例如,在JavaScript中:

resultArray = yourString.match(/[,.!?;:]|\b[a-z']+\b/ig);

<强>解释

  • 字符类[,.!?;:]匹配括号内的一个字符
  • OR(交替|
  • \b匹配字边界
  • [a-z']+一个或多个字母或撇号
  • \b字边界