JavaScript正则表达式匹配句子中的单词

时间:2016-02-14 15:33:00

标签: javascript regex string parsing sentence

匹配 JavaScript 中每个句子中特定单词的正则表达式应该是什么?

匹配句子的规则很明确: 它应以点(。)结尾,下一个字母应为大写。

但我需要达到的是在每个句子中匹配一个单词。所以我想我应该使用群组。或者我应该将字符串单词放在正则表达式中吗?

这是我用于循环句子的java正则表达式 enter link

这是我在java +5字上下文中匹配单词的java正则表达式: enter link 但是我需要在JavaScript中同时使用它们。

我的目标:

输入:

  

在该市发生地震期间,新西兰的悬崖已经倒塌   基督城在南岛。没有严重的伤害或死亡   在当地时间13点13分发生的情人节地震中报道了这一情况   时间。基于医学。报告每个人都没事。

所选单词“ on ”的输出:

  
      
  1. 新西兰克赖斯特彻奇在基督城市的地震已经崩溃 南岛
  2.   
  3. 基于 on med。报告每个人都没事。
  4.   

1 个答案:

答案 0 :(得分:2)

更新:我在下面提供两种解决方案。我的原始答案仅提供了第一个。

  1. 一种解决方案使用单个正则表达式来尝试解析整个原始段落。可以这样做,但如下所述,可能不是最佳解决方案。

  2. 另一种解决方案是更复杂的算法,但使用更轻的正则表达式。它将文本分成句子并分别处理每个句子。这个解决方案效率更高,我可以说更优雅。

  3. 解决方案1:单一正则表达式

    运行下面的第一个代码段来演示此解决方案。它会找到包含您想要的任何关键字的所有句子(如您所定义)。完整的正则表达式是......

    \. +([A-Z]([^.]|.(?! +[A-Z]))*?" + keyword + "([^.]|.(?! +[A-Z]))*?\.(?= +[A-Z]))

    ...但是代码将其分解为更容易理解的部分。

    点击“运行代码段”按钮后,需要几秒钟才能运行。

    这是一个相当正则表达式的解决方案。它可能相当慢。使用您提供的示例段落,此例程变得无法忍受地缓慢。即使这么慢,它实际上也不够复杂,因为它无法判断关键字何时嵌入另一个单词中。 (例如,当寻找“猫”时,它也会找到“猫头鹰”)。试图避免这种嵌入是可能的,但它只是让整个事情变得太慢甚至无法演示。

    var text = "I like cats. I really like cats. I also like dogs. Dogs and cats are pets. Approx. half of pets are cats. Approx. half of pets are dogs. Some cats are v. expensive.";
    
    var keyword = "cats";
    
    var reStr =
      "\. +"                   + // a preceding sentence-ender, i.e. a period
                                 //   followed by one or more spaces
      "("                      + // begin remembering the match (i.e. arr[1] below)
        "[A-Z]"                + // a sentence-starter, i.e. an uppercase letter
        "("                    + // start of a sentence-continuer, which is either
          "[^.]"               + // anything but a period
          "|"                  + // or
          "\.(?! +[A-Z])"      + // a period not followed by one or more spaces
                                 //   and an uppercase letter
        ")"                    + // end of a sentence-continuer
        "*?"                   + // zero or more of the preceding sentence-continuers
                                 //   but as few as possible
        keyword                + // the keyword being sought
        "([^.]|\.(?! +[A-Z]))" + // a sentence-continuer, as described above
        "*?"                   + // zero or more of them but as few as possible
        "\."                   + // a sentence-ender, i.e. a period
        "(?= +[A-Z])"          + // followed by one or more spaces and an
                                 //   uppercase letter, which is not remembered
      ")";                       // finish remembering the match
    
    // That ends up being the following:
    // "\. +([A-Z]([^.]|.(?! +[A-Z]))*?" + keyword + "([^.]|.(?! +[A-Z]))*?\.(?= +[A-Z]))"
    
    
    var re = new RegExp(reStr, "g"); // construct the regular expression
    
    var sentencesWithKeyword = []; // initialize an array to keep the hits
    var arr; // prepare an array to temporarily keep 'exec' return values
    var expandedText = ". " + text + " A";
    // add a sentence-ender (i.e. a period) before the text
    //   and a sentence-starter (i.e. an uppercase letter) after the text
    //   to facilitate finding the first and last sentences
    
    while ((arr = re.exec(expandedText)) !== null) { // while hits are found
      sentencesWithKeyword.push(arr[1]); // remember the sentence found
      re.lastIndex -= 2; // start the next search two characters back
                         //   to allow for starting the next match
                         //   with the period that ended the current match
    }
    
    // show the results
    show("Text to search:");
    show(text);
    show("Query string: " + keyword);
    show("Hits:");
    for (var num = 0; num < sentencesWithKeyword.length; num += 1) {
      show((num + 1) + ". " + sentencesWithKeyword[num]);
    }
    
    function show(msg) {
      document.write("<p>" + msg + "</p>");
    }

    解决方案2:分而治之

    在这里,您执行以下操作:

    • 将原始文本拆分为句子元素数组
    • 在每个句子中搜索关键字
    • 让那些拥有关键字,丢弃那些没有
    • 的关键字

    这样,你使用的任何正则表达式都不必同时处理分裂成句子,搜索关键词,保持命中和丢弃非命中,都在一个大规模的正则表达式中。

    var textToSearch = "I like cats. I really like cats. I also like dogs. Cats are great.  Catsup is tasty. Dogs and cats are pets. Approx. half of pets are cats. Approx. half of pets are dogs. Some cats are v. expensive.";
    
    var keyword = "cats";
    
    var sentences = {
      all           : [],
      withKeyword   : [],
      withNoKeyword : []
    }
    
    var sentenceRegex = new RegExp("([.]) +([A-Z])", "g");
    var sentenceSeparator = "__SENTENCE SEPARATOR__";
    var modifiedText = textToSearch.replace(sentenceRegex, "$1" + sentenceSeparator + "$2");
    sentences.all = modifiedText.split(sentenceSeparator);
    
    sentences.all.forEach(function(sentence) {
      var keywordRegex = new RegExp("(^| +)" + keyword + "( +|[.])", "i");
      var keywordFound = keywordRegex.test(sentence);
      if (keywordFound) {
        sentences.withKeyword.push(sentence);
      } else {
        sentences.withNoKeyword.push(sentence);
      }
    });
    
    document.write("<pre>" + JSON.stringify(sentences, null, 2) + "</pre>");