我想用PHP将文本分成句子。我目前正在使用正则表达式,它带来了约95%的准确度,并希望通过使用更好的方法来改进。我已经看过在Perl,Java和C中使用NLP工具,但没有看到任何适合PHP的工具。你知道这样的工具吗?
答案 0 :(得分:21)
假设您确实关心处理:Mr.
和Mrs.
等缩写,那么下面的单一正则表达式解决方案效果非常好:
<?php // test.php Rev:20160820_1800
$split_sentences = '%(?#!php/i split_sentences Rev:20160820_1800)
# Split sentences on whitespace between them.
# See: http://stackoverflow.com/a/5844564/433790
(?<= # Sentence split location preceded by
[.!?] # either an end of sentence punct,
| [.!?][\'"] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # But don\'t split after these:
Mr\. # Either "Mr."
| Mrs\. # Or "Mrs."
| Ms\. # Or "Ms."
| Jr\. # Or "Jr."
| Dr\. # Or "Dr."
| Prof\. # Or "Prof."
| Sr\. # Or "Sr."
| T\.V\.A\. # Or "T.V.A."
# Or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences,
(?=\S) # (but not at end of string).
%xi'; // End $split_sentences.
$text = 'This is sentence one. Sentence two! Sentence thr'.
'ee? Sentence "four". Sentence "five"! Sentence "'.
'six"? Sentence "seven." Sentence \'eight!\' Dr. '.
'Jones said: "Mrs. Smith you have a lovely daught'.
'er!" The T.V.A. is a big project! '; // Note ws at end.
$sentences = preg_split($split_sentences, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($sentences); ++$i) {
printf("Sentence[%d] = [%s]\n", $i + 1, $sentences[$i]);
}
?>
请注意,您可以轻松地添加或删除表达式中的缩写。鉴于以下测试段落:
这是第一句话。一句两句!判刑三?句子“四”。句子“五”!句子“六”?句子“七”。句子'八!'琼斯博士说:“史密斯太太你有一个可爱的女儿!” T.V.A.是一个很大的项目!
以下是脚本的输出:
Sentence[1] = [This is sentence one.]
Sentence[2] = [Sentence two!]
Sentence[3] = [Sentence three?]
Sentence[4] = [Sentence "four".]
Sentence[5] = [Sentence "five"!]
Sentence[6] = [Sentence "six"?]
Sentence[7] = [Sentence "seven."]
Sentence[8] = [Sentence 'eight!']
Sentence[9] = [Dr. Jones said: "Mrs. Smith you have a lovely daughter!"]
Sentence[10] = [The T.V.A. is a big project!]
该问题的作者评论说,上述解决方案“忽略了许多选项”并且不够通用。我不确定这意味着什么,但上述表达的本质是尽可能简洁明了。这是:
$re = '/(?<=[.!?]|[.!?][\'"])\s+(?=\S)/';
$sentences = preg_split($re, $text, -1, PREG_SPLIT_NO_EMPTY);
请注意,两种解决方案都能在结束标点符号后正确识别以引号结尾的句子。如果您不关心匹配以引号结尾的句子,则可以将正则表达式简化为:/(?<=[.!?])\s+(?=\S)/
。
编辑:20130820_1000 为正则表达式和测试字符串添加了T.V.A.
(另一个要忽略的标点词)。 (回答PapyRef的评论问题)
编辑:20130820_1800 整理并重命名正则表达式并添加了shebang。还修复了正则表达式以防止在尾随空格上拆分文本。
答案 1 :(得分:2)
对别人的工作略有改善:
$re = '/# Split sentences on whitespace between them.
(?<= # Begin positive lookbehind.
[.!?] # Either an end of sentence punct,
| [.!?][\'"] # or end of sentence punct and quote.
) # End positive lookbehind.
(?<! # Begin negative lookbehind.
Mr\. # Skip either "Mr."
| Mrs\. # or "Mrs.",
| Ms\. # or "Ms.",
| Jr\. # or "Jr.",
| Dr\. # or "Dr.",
| Prof\. # or "Prof.",
| Sr\. # or "Sr.",
| \s[A-Z]\. # or initials ex: "George W. Bush",
# or... (you get the idea).
) # End negative lookbehind.
\s+ # Split on whitespace between sentences.
/ix';
$sentences = preg_split($re, $story, -1, PREG_SPLIT_NO_EMPTY);
答案 2 :(得分:0)
作为一种低技术方法,您可能需要考虑在循环中使用一系列explode
调用,使用。,!和?作为你的针。这将是内存和处理器密集型(大多数文本处理)。你会得到一堆临时数组和一个主数组,所有找到的句子按照正确的顺序以数字方式编入索引。
此外,您必须检查常见的例外情况(例如先生和博士等标题中的a。),但所有内容都在数组中,这些类型的检查不应该那么糟糕。
我不确定这在速度和缩放方面是否比正则表达式更好,但它值得一试。你想把这些文本块打成句子有多大?
答案 3 :(得分:0)
我正在使用这个正则表达式:
preg_split('/(?<=[.?!])\s(?=[A-Z"\'])/', $text);
对于以数字开头的句子不起作用,但也应该很少有误报。当然,你所做的事也很重要。我的程序现在使用
explode('.',$text);
因为我认为速度比准确性更重要。
答案 4 :(得分:0)
建立一个像这样的缩写列表
$skip_array = array (
'Jr', 'Mr', 'Mrs', 'Ms', 'Dr', 'Prof', 'Sr' , etc.
将它们编译成表达式
$skip = '';
foreach($skip_array as $abbr) {
$skip = $skip . (empty($skip) ? '' : '|') . '\s{1}' . $abbr . '[.!?]';
}
最后运行这个preg_split来分解句子。
$lines = preg_split ("/(?<!$skip)(?<=[.?!])\s+(?=[^a-z])/",
$txt, -1, PREG_SPLIT_NO_EMPTY);
如果您正在处理HTML,请注意标记被删除,以消除句子之间的空格。<p></p>
如果situations.Like
这个where.They
粘在一起,那么变得非常困难解析。
答案 5 :(得分:0)
@ridgerunner我用C#编写了你的PHP代码
结果是2个句子:
正确的结果应该是句子: Mr。 J.DujardinréglesaT.V.A.特别是uniquement
以及我们的测试段落
string sText = "This is sentence one. Sentence two! Sentence three? Sentence \"four\". Sentence \"five\"! Sentence \"six\"? Sentence \"seven.\" Sentence 'eight!' Dr. Jones said: \"Mrs. Smith you have a lovely daughter!\" The T.V.A. is a big project!";
结果是
index: 0 sentence: This is sentence one.
index: 22 sentence: Sentence two!
index: 36 sentence: Sentence three?
index: 52 sentence: Sentence "four".
index: 69 sentence: Sentence "five"!
index: 86 sentence: Sentence "six"?
index: 102 sentence: Sentence "seven.
index: 118 sentence: " Sentence 'eight!'
index: 136 sentence: ' Dr. Jones said: "Mrs. Smith you have a lovely daughter!
index: 193 sentence: " The T.V.
index: 203 sentence: A. is a big project!
C#代码:
string sText = "Mr. J. Dujardin régle sa T.V.A. en esp. uniquement";
Regex rx = new Regex(@"(\S.+?
[.!?] # Either an end of sentence punct,
| [.!?]['""] # or end of sentence punct and quote.
)
(?<! # Begin negative lookbehind.
Mr. # Skip either Mr.
| Mrs. # or Mrs.,
| Ms. # or Ms.,
| Jr. # or Jr.,
| Dr. # or Dr.,
| Prof. # or Prof.,
| Sr. # or Sr.,
| \s[A-Z]. # or initials ex: George W. Bush,
| T\.V\.A\. # or "T.V.A."
) # End negative lookbehind.
(?=|\s+|$)",
RegexOptions.CultureInvariant | RegexOptions.IgnorePatternWhitespace | RegexOptions.Compiled);
foreach (Match match in rx.Matches(sText))
{
Console.WriteLine("index: {0} sentence: {1}", match.Index, match.Value);
}
答案 6 :(得分:-1)