Question

我的客户提出这个复杂的问题，我找不到答案所以现在我会试着问你们。

任务如下：

我认为一条规则可能是：Dots   在a之后立即出现   号码，不算作句子。这个   意味着句子存在于   “8.集市”和“2.567”不计算在内   字点。作为回报，每个字点   可能会被忽视（如果现在是一句话   以数字结尾：“Vi kommer kl.8”）   但它可能毕竟不是很完美   经常。

另一个可能是：如果有的话   字符（字母或数字）   一句话后不是立即   短语句子。这样就可以了   我们避免计算句子   出现在“f.eks。”，“bl.a”中。   和“cand.mag。”。

我希望我能在这里受到帮助。

我的代码：

<script>
function word_count(field, count) {

    var wordsNumberOverSeven = 0;
    var wordsNumber = 0

    var contentText = $(\'#lix_word_count\').val();
    contentText = contentText.replace(\'?\', \'.\');
    contentText = contentText.replace(\'!\', \'.\');
    contentText = contentText.replace(\',\', \'\');
    contentText = contentText.replace(\';\', \'\');
    contentText = contentText.replace(\':\', \'\');
    contentText = contentText.replace(\'\n\', \' \').replace(/^\s+|\s+$/g,\'\').replace(/\s\s+/g,\' \');

    var matchDots = contentText.split(\'.\').length-1;
    var match = contentText.split(\' \');

    $.each(match, function(){
        if ( this.length > 0 )
            wordsNumber += 1;

        if ( this.length >= 7 )
        {
            wordsNumberOverSeven += 1;
        }

    });

    var lixMatWords = wordsNumber / matchDots;
    var lixMatLongWords = ( wordsNumberOverSeven * 100 ) / wordsNumber;

    var lixMatch = Math.round(( lixMatWords + lixMatLongWords ) *100)/100;
    var lixType = \'\';

    if ( lixMatch <= 24 )
        lixType = \'Lixen i din tekst er \'+ lixMatch +\', dvs. at teksten er meget let at læse.\';
    else if ( lixMatch <= 34 )
        lixType = \'Lixen i din tekst er \'+ lixMatch +\', dvs. at teksten er let at læse\';
    else if ( lixMatch <= 44 )
        lixType = \'Lixen i din tekst er \'+ lixMatch +\', dvs. at teksten ligger i midterområdet.\';
    else if ( lixMatch <= 54 )
        lixType = \'Lixen i din tekst er \'+ lixMatch +\', dvs. at teksten er svær at læse.\';
    else
        lixType = \'Lixen i din tekst er \'+ lixMatch +\', dvs. at teksten er meget svær at læse.\';

    /** alert(lixType +\'\nDots: \'+ matchDots +\'\nWords: \'+ wordsNumber +\'\nLangeord: \'+ wordsNumberOverSeven); **/
    alert(lixType);
}
</script>

Answer 1

我认为我们需要看到其余规则，或者至少要看一些规则。

或许最好描述一下你想要包括哪个句子，而不是排除什么。如果您正在寻找完整的句子，那么它可能是一个以非空格字符开头的句点，后跟一个空格或换行符或换行符，或者一些更复杂的规则集。它可能需要多个正则表达式和一些其他逻辑来对更复杂的情况进行排序。

Answer 2

如果你想根据该规则分割句子，那么就像

mySentences.match(/(?:[^.0-9]|[0-9]+\.?|\.[a-z0-9])+(?:\.|$)/ig)

应该这样做。

您必须展开a-z以在您的语言中包含重音字符，但这应该是这样做的。

它为您的输入文本生成以下内容。

["I think that one rule might be: Dots which appears immediately after a number, not counted as sentences.",
 " This means that sentence present in the \"8. marts\"and \"2.567\" is not counted as word dots.",
 " In return, each word dots may be overlooked (if now a sentence ends with a number: \"Vi kommer kl.",
 " 8\") but it's probably after all not quite as often.",
 "\n\nAnother might be: If there is one character (a letter or number) immediately after a sentence is not a phrase sentence.",
 " That would make that we avoided counting the sentence present in the \"f.eks.",
 "\", \"bl.a.","\" and \"cand.mag.",
 "\"."]

很明显，出现在引用部分内的点有问题。只要句子在引用的部分内结束，你就可以通过步行和重新加入来解决这个问题。

// Given mySentences defined above, walk counting quote characters.
// You could modify the regexp below if your language tends to use
// a different quoting style, e.g. French-style angle quotes.
for (var i = 0; i < mySentences.length - 1; ++i) {
  var quotes = mySentences[i].match(/["\u201c\u201d]/g);
  // If there are an odd number of quotes, combine the next sentence
  // into this one.
  if (quotes && quotes.length % 2) {
    // In English, it is common to end the quoted section after the
    // closing punctuator: Say "hello."
    var next =  mySentences[i + 1];
    if (/^["\u201c\u201d]/.test(next)) {
      mySentences[i] += next.substring(0, 1);
      mySentences[i + 1] = next.substring(1);
    } else {
      mySentences[i] += next;
      mySentences.splice(i, 1);
      --i;  // See if there's more to combine into this sentence.
    }
  }
}

这种东西虽然很脆弱。如果你想知道专门研究这种事情的人是如何做的，那就搜索“自然语言细分”。

javascript正则表达式帮助

2 个答案: