JS RegEx将文本拆分成句子

时间:2014-12-24 01:42:07

标签: javascript regex

我对javascript的正则表达式有点困难;

继承我的小提琴:http://jsfiddle.net/6yhwzap0/

我创建的功能是:

var splitSentences = function(text) {
    var messy = text.match(/\(?[^\.\?\!]+[\.!\?]\)?/g);
    var clean = [];
    for(var i = 0; i < messy.length; i++) {
        var s = messy[i];
        var sTrimmed = s.trim();
        if(sTrimmed.length > 0) {
            if(sTrimmed.indexOf(' ') >= 0) {
                clean.push(sTrimmed);
            } else {
                var d = clean[clean.length - 1];
                d = d + s;

                var e = messy[i + 1];
                if(e.trim().indexOf(' ') >= 0) {
                    d = d + e;
                    i++;
                }
                clean[clean.length - 1] = d;
            }
        }
    }
    return clean;
};

我用text.match(/\(?[^\.\?\!]+[\.!\?]\)?/g);得到了非常好的结果我的一大问题是,如果一个字符串在句点之后有一个引号,它会被添加到下一个句子中。

例如以下内容:

"Hello friend. My name is Mud." Said Mud.

应分成以下数组:

['"Hello friend.', 'My name is Mud."', 'Said Mud.']

但它取而代之的是:

['"Hello friend.', 'My name is Mud.', '" Said Mud.']

(参见“说泥浆”字符串中的引用)

任何人都可以帮我这个或者指向一个好的JavaScript库,可以将文本分成段落,句子和单词吗?我发现blast.js但我使用的是Angular.js,它根本没有很好地集成。

3 个答案:

答案 0 :(得分:3)

我建议您使用string.match代替string.split

\S.*?\."?(?=\s|$)

DEMO

> var s = '"Hello friend. My name is Mud." Said Mud.'
undefined
> s.match(/\S.*?\."?(?=\s|$)/g)
[ '"Hello friend.',
  'My name is Mud."',
  'Said Mud.' ]

答案 1 :(得分:2)

以下是如何解决您的直接问题的示例。但是,评估句子的特征显然与解析文本元素不同。

正则表达式最适用于 deterministic algorithms 。句子大部分都是 non-deterministic 并且需要解释。对于该类型的用例,您需要 natural language processing 库。

Natural 是Node.js的NLP库,可能是您用例的一个很好的解决方案。但是,我没有亲自使用它。 YMMV。

Alchemy 是另一个选项,可以使用全功能的NLP API作为REST Web服务。

Full Page Demo

<强> RegEx Tester


var text = "If he's restin', I'll wake him up! (Shouts at the cage.) 'Ello, Mister Polly Parrot! (Owner hits the cage.) There, he moved!!!\r\n\r\nNorth Korea is accusing the U.S. government of being behind the making of the movie \"The Interview.\"\r\n\r\nAnd, in a dispatch on state media, the totalitarian regime warns the United States that U.S. \"citadels\" will be attacked, dwarfing the attack on Sony that led to the cancellation of the film's release.\r\n\r\nWhile steadfastly denying involvement in the hack, North Korea accused U.S. President Barack Obama of calling for \"symmetric counteraction.\"\r\n\r\n\"The DPRK has already launched the toughest counteraction. Nothing is more serious miscalculation than guessing that just a single movie production company is the target of this counteraction. Our target is all the citadels of the U.S. imperialists who earned the bitterest grudge of all Koreans,\" a report on state-run KCNA read.";

var splitSentences = function() {

  var pattern = /(.+?([A-Z].)[\.|\?](?:['")\\\s]?)+?\s?)/igm, match;
  var ol = document.getElementById( "result" );
  while( ( match = pattern.exec( text )) != null ) {
    if( match.index === pattern.lastIndex ) {
      pattern.lastIndex++;
    }
    var li = document.createElement( "li" );
    li.appendChild( document.createTextNode( match[0] ) );
    ol.appendChild( li );
    console.log( match[0] );
  }

}();
<ol id="result">
</ol>


表达式

     /(.+?([A-Z].)[\.|\?](?:['")\\\s]?)+?\s?)/igm

    1st Capturing group (.+?([A-Z].)[\.|\?](?:['")\\\s]?)+?\s?)
        .+? matches any character (except newline)
        Quantifier: +? Between one and unlimited times, as few times as possible, expanding as needed [lazy]
    2nd Capturing group ([A-Z].)
        [A-Z] match a single character present in the list below
        A-Z a single character in the range between A and Z
        . matches any character (except newline)
        [\.|\?] match a single character present in the list below
        \. matches the character . literally
        \? matches the character ? literally
    (?:['")\\\s]?)+? Non-capturing group
        Quantifier: +? Between one and unlimited times, as few times as possible, expanding as needed [lazy]
        ['")\\\s] match a single character present in the list below
        '") a single character in the list '") literally (case insensitive)
        \\ matches the character \ literally
        \s match any white space character [\r\n\t\f ]          
      \s? match any white space character [\r\n\t\f ]
        Quantifier: ? Between zero and one time, as many times as possible, giving back as needed [greedy]
i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z])
g modifier: global. All matches (don't return on first match)
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)  

答案 2 :(得分:1)

Regexp是一种非常生硬的工具,并不是进行自然语言处理的正确方法,就是这样。您需要找到一个库来执行此操作,或编写自己的库。

除了您使用引号标记发现的问题之外,您当然必须处理缩写。此外,如果您的应用程序将与其他语言一起使用,您将必须实现逻辑,以便以不同的方式分隔每种语言的句子。正如我所说,找到一个图书馆。

您可能能够找到有点有效的正则表达式,然后第一次出现另一个边缘情况,例如处理嵌套引号:

  

“当Sally说'Regexps对NLP不好。写一个解析器',我同意了,”Bob说。

然后你将用你的余生来修复你的巨大正面现象,或者更有可能的是,碰到一堵砖墙,你根本无法做到你想做的事。