使用引文解析文本文件

时间:2017-04-24 19:12:10

标签: php regex parsing nlp

问题是我正在尝试使用php通过句子来分割文本文件。我目前使用以下功能:

$results = preg_split('/(?<=[.?!])\s+/', $stringtest, -1, PREG_SPLIT_NO_EMPTY);

问题在于这些句子:

In his book The Symposium, Plato wrote “Those who are halves of a man whole pursue males, and being slices, so to speak, of the male, love men throughout their boyhood, and take pleasure in physical contact with men” (qtd. in Isay 11).

它像这样拆分:

[0] In his book The Symposium, Plato wrote “Those who are halves of a man whole pursue males, and being slices, so to speak, of the male, love men throughout their boyhood, and take pleasure in physical contact with men” (qtd. 
[1] in Isay 11).

另一个例子是:

Dr. Evelyn Hooker, a heterosexual psychologist...

Dr.部分将成为一个问题。

这些文本全部来自NLP的MASC语料库。

1 个答案:

答案 0 :(得分:1)

您可以延长@ndn's solution以达到您的需要。请注意,$before_regexes包含已知缩写的列表,添加您的语料库中存在的缩写。我在那里添加了qtd

然后,请注意$before_regexes$after_regexes配对。我添加了'/(?:[”’"\'»])\s*\Z/u' / '/\A(?:\(\p{L})/u'对并将其标记为非句子边界(false数组中的第一个$is_sentence_boundary。正则表达对意味着:找到引号(”’"'»),0 +空格,然后是((带\()和任何Unicode字母(\p{L}),那么就不应该拆分。

function sentence_split($text) {
    $before_regexes = array('/(?:[”’"\'»])\s*\Z/u',
        '/(?:(?:[\'\"„][\.!?…][\'\"”]\s)|(?:[^\.]\s[A-Z]\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s)|(?:\b(?:St|Gen|Hon|Prof|Dr|Mr|Ms|Mrs|[JS]r|Col|Maj|Brig|Sgt|Capt|Cmnd|Sen|Rev|Rep|Revd)\.\s[A-Z]\.\s)|(?:\bApr\.\s)|(?:\bAug\.\s)|(?:\bBros\.\s)|(?:\bCo\.\s)|(?:\bCorp\.\s)|(?:\bDec\.\s)|(?:\bDist\.\s)|(?:\bFeb\.\s)|(?:\bInc\.\s)|(?:\bJan\.\s)|(?:\bJul\.\s)|(?:\bJun\.\s)|(?:\bMar\.\s)|(?:\bNov\.\s)|(?:\bOct\.\s)|(?:\bPh\.?D\.\s)|(?:\bSept?\.\s)|(?:\b\p{Lu}\.\p{Lu}\.\s)|(?:\b\p{Lu}\.\s\p{Lu}\.\s)|(?:\bcf\.\s)|(?:\be\.g\.\s)|(?:\besp\.\s)|(?:\bet\b\s\bal\.\s)|(?:\bvs\.\s)|(?:\p{Ps}[!?]+\p{Pe} ))\Z/su',
        '/(?:(?:[\.\s]\p{L}{1,2}\.\s))\Z/su',
        '/(?:(?:[\[\(]*\.\.\.[\]\)]* ))\Z/su',
        '/(?:(?:\b(?:pp|[Vv]iz|i\.?\s*e|[Vvol]|[Rr]col|maj|Lt|[Ff]ig|[Ff]igs|[Vv]iz|[Vv]ols|[Aa]pprox|[Ii]ncl|Pres|[Dd]ept|min|max|[Gg]ovt|lb|ft|c\.?\s*f|vs|qtd)\.\s))\Z/su',
        '/(?:(?:\b[Ee]tc\.\s))\Z/su',
        '/(?:(?:[\.!?…]+\p{Pe} )|(?:[\[\(]*…[\]\)]* ))\Z/su',
        '/(?:(?:\b\p{L}\.))\Z/su',
        '/(?:(?:\b\p{L}\.\s))\Z/su',
        '/(?:(?:\b[Ff]igs?\.\s)|(?:\b[nN]o\.\s))\Z/su',
        '/(?:(?:[\"”\']\s*))\Z/su',
        '/(?:(?:[\.!?…][\x{00BB}\x{2019}\x{201D}\x{203A}\"\'\p{Pe}\x{0002}]*\s)|(?:\r?\n))\Z/su',
        '/(?:(?:[\.!?…][\'\"\x{00BB}\x{2019}\x{201D}\x{203A}\p{Pe}\x{0002}]*))\Z/su',
        '/(?:(?:\s\p{L}[\.!?…]\s))\Z/su');
    $after_regexes = array('/\A(?:\(\p{L})/u',
        '/\A(?:)/su',
        '/\A(?:[\p{N}\p{Ll}])/su',
        '/\A(?:[^\p{Lu}])/su',
        '/\A(?:[^\p{Lu}]|I)/su',
        '/\A(?:[^p{Lu}])/su',
        '/\A(?:\p{Ll})/su',
        '/\A(?:\p{L}\.)/su',
        '/\A(?:\p{L}\.\s)/su',
        '/\A(?:\p{N})/su',
        '/\A(?:\s*\p{Ll})/su',
        '/\A(?:)/su',
        '/\A(?:\p{Lu}[^\p{Lu}])/su',
        '/\A(?:\p{Lu}\p{Ll})/su');
    $is_sentence_boundary = array(false, false, false, false, false, false, false, false, false, false, false, true, true, true);
    $count = 13;

    $sentences = array();
    $sentence = '';
    $before = '';
    $after = substr($text, 0, 10);
    $text = substr($text, 10);

    while($text != '') {
        for($i = 0; $i < $count; $i++) {
            if(preg_match($before_regexes[$i], $before) && preg_match($after_regexes[$i], $after)) {
                if($is_sentence_boundary[$i]) {
                    array_push($sentences, $sentence);
                    $sentence = '';
                }
                break;
            }
        }

        $first_from_text = $text[0];
        $text = substr($text, 1);
        $first_from_after = $after[0];
        $after = substr($after, 1);
        $before .= $first_from_after;
        $sentence .= $first_from_after;
        $after .= $first_from_text;
    }

    if($sentence != '' && $after != '') {
        array_push($sentences, $sentence.$after);
    }

    return $sentences;
}

请参阅PHP demo