javascript regexp用于标识句子的不同组成部分

时间:2015-08-19 22:06:23

标签: javascript regex

我有一个非常具体的要求。考虑句子"我是机器人X-rrt,我35岁,我的创造者是5-MAF。这里的一切都是我的世界的5倍 - 欢呼"

我感兴趣的是一个正则表达式,它承认"我"," am"," a" ,"机器人"," X-rrt",",","我"," am",& #34; 35","和","我的#34;,"创作者","是"," 5- MAF","。"," Everthing"," here",""," 5" ,"次","比","我"," world5"," - ",&#34 ;欢呼"

即1)它应该识别所有标点,除了" - "当它成为一个词的一部分 2)如果不能单独识别包含字母的单词的一部分,则为数字

我对这个非常困惑。非常感谢一些建议!

5 个答案:

答案 0 :(得分:1)

尝试在每组空格中分割,在点和逗号之前:

str.split(/\s+|(?=[.,])/);

答案 1 :(得分:0)

尝试此匹配regexp:

str.match(/ [\ W \ d - ] + | |,/克);

答案 2 :(得分:0)

这不太容易。我建议在拆分之前对文本进行一些预处理,例如:



var text = "I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray";
var preprocessedText = text.replace(/(\w|^)(\W)( |$)/g, "$1 $2$3");
var tokens = preprocessedText.split(" ");
alert(tokens.join("\n"));




答案 3 :(得分:0)

我用perl测试了这个。不应该太难翻译成javascript。

my $sentence = 'I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray';

my @words = split(/\s|(?<!-)\b(?!-)/, $sentence);

say "'" . join ("', '", @words) . "'";

答案 4 :(得分:0)

这是一个满足您要求的解决方案:

/(?:\w|\b-\b)+|[^\w\s]+/g

请参阅regex demo

<强>详情:

  • (?:\w|\b-\b)+ - 1个或更多
    • \w - word char
    • | - 或
    • \b-\b - 字词之间的连字符
  • | - 或
  • [^\w\s]+ - 除了单词和空白符号之外的1个或多个字符。

请参阅下面的JS演示:

var s = "I am a robot X-rrt, I am 35 and my creator is 5-MAF. Everything here is 5 times than my world5 - hurray";
console.log(s.match(/(?:\w|\b-\b)+|[^\w\s]+/g));