使用PCRE正则表达式从文本中解析电子邮件标题

时间:2013-10-29 11:40:39

标签: php regex preg-match preg-split

我需要解析(拆分)包含从Outlook导出的电子邮件的文本文件。 我使用preg_splitPREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE

分开

我的目标是使用正则表达式捕获邮件标题部分,即从“From:”行开始,以邮件正文之前的空行结束。

约束:

  • 预期的多语言字段名称
  • 标题字段数量各不相同(CC,BCC,附件)
  • 某些字段可能位于多行(To,CC,BCC,Subject,Attachments)

预处理文本文件:用单个空格替换多个空格和制表符,替换前导和尾随空格。

我整天都在这里,无法让最后一部分工作。它适用于[gskinner regex测试页面]:http://regexr.com?36v27,但不适用于php。

主题:

From: Black, Jack (LA)
Sent: Monday, October 28, 2013 6:36 PM
To: George, Jackson (London); DCS.CC.DARWIN (Australia)
Cc: Bar, Foo (Istanbul); Ex, Reg (Istanbul); Smith, John (Istanbul); Rambo,
John J. (Gaziantep); Matrix, John (Phuket)
Subject: RE: PREVENTIVE AND CORRECTIVE ACTIONS / FOOBAR

Dear George,

venenatis imperdiet quam. Proin a egestas nunc, et mattis elit. In hac habitasse platea dictumst. Nulla dolor nibh, tempus ut neque eu, tempus fermentum mauris. Mauris nec ipsum nec sapien commodo scelerisque ut eu urna. Pellentesque eu neque in enim adipiscing faucibus. Sed interdum arcu et sem mollis iaculis. Duis euismod laoreet ligula lacinia dapibus. Vestibulum ullamcorper malesuada metus at malesuada. 

 Nullam enim elit, auctor vehicula orci eget, imperdiet feugiat odio. Etiam dapibus sagittis sem a varius. Nulla sit amet convallis mi, sit amet rutrum ipsum. In libero lectus, mattis at dui eu.

Thank you and best regards,

Jack B. Black (Mr)
Operations Manager (GGD)
FU Supervisor (R34, R57)

Phone: +1112212212 (local 1111)
Mobile: +12 121.111.11.12

From: George, Jackson (UK)
Sent: Monday, October 28, 2013 5:57 PM
To: DCS.CC.DARWIN (Australia)
Bar, Foo (Istanbul); Ex, Reg (Istanbul); Smith, John (Istanbul); Rambo,
John J. (Gaziantep); Matrix, John (Phuket)
Subject: PREVENTIVE AND CORRECTIVE ACTIONS / FOOBAR

Dear Colleagues,

ermentum. Duis ipsum quam, bibendum a risus nec, tincidunt fringilla lectus. Nunc vel dictum massa, et cursus nunc. Mauris tincidunt felis eget justo congue volutpat. Nulla condimentum accumsan elementum. Integer commodo, lorem eu pharetra suscipit, ligula.

Best Regards.

SDFD srfgGD
Field coordinator (GGD)
Customer Representative

sds dfsd sdfgsef sdfsd
sgzdfgdfg fgfg gdfg
Footer text etc
sdfdfdf dfgsdfgsdfgsdfg
Phone : +90 212 368 40 00 (ext:3814)

正则表达式:

preg_match(
                 '/                         # delimiter
                (                           # capturing group start
                [\ A-Z][a-z]+:.+\(.+\)\R    # From: field
                [A-Z][a-z]+:.+\R            # Sent: fields
                [A-Z][a-z]+:.+\R            # To: field (1st line)
                (?:.+\R)+              # any additional header lines, before blank line (To, CC, BCC, Subject, Attachments)
                )                           # capturing group end
                # delimiter + modifiers /x',$text_clean, $matches);
        echo '<b>Matches: '.count($matches).'</b>';
        print_r($matches);   

我在获取其他标题行时遇到问题:

(?:.+\R)+              # any additional header lines...

感谢任何帮助

2 个答案:

答案 0 :(得分:1)

最简单的方法是使用带有延迟量词的preg_match_all:

preg_match_all('/^From.*?\R\R/ims', $mails, $matches);
print_r($matches);

答案 1 :(得分:0)

感谢大家的输入,不过我用我的方法来计算它。 有几点让我感到困惑,但工作解决方案还在下面。

  1. 为什么preg_match会返回第一个结果而不是两次匹配:(http://www.ideone.com/Xj6aaF1

  2. (?:.+\R)+点似乎匹配任何字符和无字符,这就是为什么它保持缺少空行。我觉得奇怪 - 不是+应该是1 or more quantifier吗?

  3. 无论如何,当我将我的正则表达式模式更改为(?:\S.+\R)+时,它会使用preg_split执行我想要的操作。

    Demo

    虽然从技术上讲我的问题已经解决,但我希望有人能够解释上述两点。