preg_split基于句子

时间:2015-01-13 20:04:12

标签: php regex

我有完整的脚本来分割句子。除了标点符号之外,我还想将一些短语视为句子的结尾。如果它是单个字符,则可以正常工作,但如果有空格则不行。

这是我的代码:

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?:\#*]             # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n]              # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear  
)                   # End positive lookbehind.
(?<!                # Begin negative lookbehind.
  Mr\.              # Skip either "Mr."
| Mrs\.             # or "Mrs.",    
| Ms\.              # or "Ms.",
| Jr\.              # or "Jr.",
| Dr\.              # or "Dr.",
| Prof\.            # or "Prof.",
| U\.S\.A\.
| U\.S\.
| Sr\.              # or "Sr.",
| T\.V\.A\.         # or "T.V.A.",
| a\.m\.            # or "a.m.",
| p\.m\.            # or "p.m.",
| a€¢\.
| :\.

                    # or... (you get the idea).
)                   # End negative lookbehind.
\s+                 # Split on whitespace between sentences.

/ix';

这是我尝试添加的示例短语: “总收入”

我试过用这些方法形成它,但它们都不起作用:

$re = '/# Split sentences on whitespace between them.
(?<=                # Begin positive lookbehind.
  [.!?:\#*]             # Either an end of sentence punct,
| [.!?:][\'"]
| [\r\t\n]              # or end of sentence punct and quote.
| HYPERLINK
| .org
| .gov
| .aspx
| .com
| Date
| Dear  
| "Total Gross Income"
| Total[ X]Gross[ X]Income
| Total" "Gross" "Income
)  

例如,如果我有以下代码:

$block_o_text = "You could receive the wrong amount. If you receive more benefits than you    should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross Income Total ResourcesMedical ProgramsHousehold.";

$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_NO_EMPTY);

for ($i = 0; $i < count($sentences); ++$i) {
    echo $i . " - " . $sentance . "<BR>";
}

我得到的结果是:

77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income Total ResourcesMedical ProgramsHousehold 

我想得到的是:

77 - You could receive the wrong amount.
78 - If you receive more benefits than you should, you must pay them back.
79 - When will we review your case?
80 - An eligibility review form will be sent before your benefits stop.
81 - 01/201502/2015
82 - Total Gross Income
83 - Total ResourcesMedical ProgramsHousehold 

我做错了什么?

1 个答案:

答案 0 :(得分:1)

你的问题在于你的lookbehind后面的空白声明 - 它需要至少一个空格才能拆分,但是如果你删除它,那么你最终捕获前面的字母并打破整个事情。

因此,据我所知,你不能完全用外表来做这件事。你仍然需要让一些表达式使用lookarounds(空格前面有标点符号等),但对于特定的短语,你不能。

您还可以使用PREG_SPLIT_DELIM_CAPTURE标记来捕获您要分割的内容。这样的事情应该让你开始:

$re = '/((?<=[\.\?\!])\s+|Total\sGross\sIncome)/ix';

$block_o_text = "You could receive the wrong amount. If you receive more benefits than you    should, you must pay them back. When will we review your case? An eligibility review form will be sent before your benefits stop. Total Gross IncomeTotal ResourcesMedical ProgramsHousehold.";

$sentences = preg_split($re, $block_o_text, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

for ($i = 0; $i < count($sentences); ++$i) {
    if (!ctype_space($sentences[$i])) {
        echo $i . " - " . $sentences[$i] . "<br>";
    }
}

输出:

0 - You could receive the wrong amount.
2 - If you receive more benefits than you should, you must pay them back.
4 - When will we review your case?
6 - An eligibility review form will be sent before your benefits stop.
8 - Total Gross Income
9 - Total ResourcesMedical ProgramsHousehold.