RegEx不包括学术头衔

时间:2013-03-09 04:21:56

标签: php regex text

我希望将段落字符串拆分成句子数组。当然我使用带有字符点(。)的正则表达式将段落分成句子。问题是句子中的学术标题缩写,每个缩写都使用点(。)。因此,我的正则表达式完全错误地分割了段落。

以下是段落的示例:

  同时担任茂物农业部长   大学,Herry Suhardiyanto教授,   在他的评论中要求研究生继续学习   学习并将按时完成学业。出现在那里   一般观众是研究生的副院长   茂物农业大学,德迪博士   Jusadi,研究生院秘书   茂物农业大学博士课程,   Prof.Dr. Marimin。

只使用点(。)作为正则表达式,我得到:

Array (
[0] => Meanwhile Rector of Bogor Agricultural University, Prof
[1] => Dr
[2] => Herry Suhardiyanto, in his remarks requested that the graduate students should keep on studying and will finalize their studies on time
[3] => ...
)

这实际上我想要:

Array (
[0] => Meanwhile Rector of Bogor Agricultural University, Prof. Dr. Herry Suhardiyanto, in his remarks requested that the graduate students should keep on studying and will finalize their studies on time
[1] => Present in  that general audience were  the Deputy Dean of the Graduate School of Bogor Agricultural University, Dr.Dedi Jusadi, Secretary of the Graduate School for Doctoral Program of Bogor Agricultural University, Prof.Dr. Marimin
)

2 个答案:

答案 0 :(得分:3)

你可以使用负面观察:

((?<!Prof)(?<!Dr)(?<!Mr)(?<!Mrs)(?<!Ms))\.根据需要添加更多

在此解释演示:http://regex101.com/r/xQ3xF9

代码看起来像这样:

$text="Meanwhile Rector of Bogor Agricultural University, Prof. Dr. Herry Suhardiyanto, in his remarks about Mr. John requested that the graduate students should keep on studying and will finalize their studies on time. Present in that general audience were Mrs. Peterson of the Graduate School of Bogor Agricultural University, Dr.Dedi Jusadi, Secretary of the Graduate School for Doctoral Program of Bogor Agricultural University, Prof.Dr. Marimin.";

$titles=array('(?<!Prof)', '(?<!Dr)', '(?<!Mr)', '(?<!Mrs)', '(?<!Ms)');
$sentences=preg_split('/('.implode('',$titles).')\./',$text);
print_r($sentences);

答案 1 :(得分:1)

这似乎有效,但是新的PHP函数与严格的RegEx -

相比
$begin = array( 0=>'Meanwhile in geography,',
            1=>'Dr',
            2=>'Henry Suhardiyanto, in his remarks, stated that ',
            3=>'Dr',
            4=>'Prof',
            5=>'Jedi Dusadi was another ',
            6=>'Prof');

$exclusions = array("Dr", "Prof", "Mr", "Mrs");

foreach ($begin as $pos => $sentence) {
if (in_array($sentence, $exclusions)) {
    $begin[$pos+1] = $sentence . ". " . $begin[$pos+1];
    unset($begin[$pos]);
    array_values($begin);
    }
}