正则表达式句子内容匹配

时间:2014-02-18 16:37:21

标签: php regex

调整php sentence boundaries detection的最佳答案。

有没有人可以帮我重新设置上面的正则表达式来匹配句子边界之间的内容而不是边界本身?

这是为preg_split构建的,我需要它用于preg_replace_callback。

以下是我到目前为止的尝试,但无法使其与最后一句相匹配,因为它依赖于外观来检查边界:

http://regex101.com/r/nH7mC5 - 这包含示例输出减去最后一句。

1 个答案:

答案 0 :(得分:0)

我是cited sentence splitting answer的作者。这是一个可能适合您目的的修改版本:

增强的正则表达式解决方案

假设您确实关心处理:Mr.Mrs.等缩写,那么下面的单一正则表达式解决方案效果非常好:

<?php // test.php Rev:20140218_1500
$re = '/# Match sentence ending in .!? followed by optional quote.
    (                  # $1: Sentence.
      [^.!?]+          # One or more non-end-of-sentence chars.
      (?:              # Zero or more not-end-of-sentence dots.
        \.             # Allow dot mid-sentence, but only if:
        (?:            # Group allowable dot alternatives.
          (?=[^\s\'"]) # Dot is ok if followed by non-ws,
        | (?<=         # or not one of the following:
            Mr\.       # Either "Mr."
          | Mrs\.      # or "Mrs.",
          | Ms\.       # or "Ms.",
          | Jr\.       # or "Jr.",
          | Dr\.       # or "Dr.",
          | Prof\.     # or "Prof.",
          | Sr\.       # or "Sr.",
          | T\.V\.A\.  # or "T.V.A.",
                       # or... (you get the idea).
          )            # End positive lookbehind.
        )              # Group allowable dot alternatives.
        [^.!?]*        # Zero or more non-end-of-sentence chars.
      )*               # Zero or more not-end-of-sentence dots.
      (?:              # Sentence end alternatives.
        [.!?]          # Either end of sentence punctuation
        [\'"]?         # followed by optional quote,
      | $              # Or end of string with no punctuation.
      )                # Sentence end alternatives.
    )                  # End $1: Sentence.
    (?:\s+|$)          # Sentence ends with ws or EOS.
    /ix';

$text = 'This is sentence one. Sentence two! Sentence thr'.
        'ee? Sentence "four". Sentence "five"! Sentence "'.
        'six"? Sentence "seven." Sentence \'eight!\' Dr. '.
        'Jones said: "Mrs. Smith you have a lovely daught'.
        'er!" The T.V.A. is a big project! Last sentence '.
        'with no ending punctuation';

$sentences = array(); // Initialize array of sentences.

function _getSentencesCallback($matches) {
    global $sentences;
    $sentences[] = $matches[1];
    return '';
}
preg_replace_callback($re, '_getSentencesCallback', $text);
for ($i = 0; $i < count($sentences); ++$i) {
    printf("Sentence[%d] = [%s]\n", $i + 1, $sentences[$i]);
}
?>

请注意,您可以轻松地添加或删除表达式中的缩写。鉴于以下测试段落:

  

这是第一句话。一句两句!判刑三?句子“四”。句子“五”!句子“六”?句子“七”。句子'八!'琼斯博士说:“史密斯太太你有一个可爱的女儿!” T.V.A.是一个很大的项目!

以下是脚本的输出:

Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The T.V.A. is a big project!]
Sentence[11] = [Last sentence with no ending punctuation]

希望这有助于和快乐的复兴!

编辑时间:2014-02-19 08:00 字符串末尾的最后一句不再需要标点符号。