尝试从MIME电子邮件正文中提取纯文本

时间:2011-11-21 15:55:01

标签: php regex email mime plaintext

问题:

我正在开发一个电子邮件系统。我们收到电子邮件并将其存储在MySQL数据库中。 解析了正文,删除了标题等等。但是当我们收到MIME格式的电子邮件时,正文数据会存储到数据库中,如下所示:

This is a multi-part message in MIME format.

------=_NextPart_000_1B20_01CCA865.03078710
Content-Type: text/plain;
charset=\"us-ascii\"
Content-Transfer-Encoding: 7bit

This Message is intended for the indicated recipients only and may be
confidential. If this message has been sent to you in error you must take no
action based on it, nor must you copy or show it to anyone; please inform us
immediately and delete this message.




------=_NextPart_000_1B20_01CCA865.03078710
Content-Type: text/html;
charset=\"us-ascii\"
Content-Transfer-Encoding: quoted-printable

<html xmlns:v=3D\"urn:schemas-microsoft-com:vml\" =
xmlns:o=3D\"urn:schemas-microsoft-com:office:office\" =
xmlns:w=3D\"urn:schemas-microsoft-com:office:word\" =
xmlns:m=3D\"http://schemas.microsoft.com/office/2004/12/omml\" =
xmlns=3D\"http://www.w3.org/TR/REC-html40\"><head><META =
HTTP-EQUIV=3D\"Content-Type\" CONTENT=3D\"text/html; =
charset=3Dus-ascii\"><meta name=3DGenerator content=3D\"Microsoft Word 12 =
(filtered medium)\"><style><!--
/* Font Definitions */
@font-face
{font-family:\"Cambria Math\";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
{font-family:Tahoma;
panose-1:2 11 6 4 3 5 4 4 2 4;}
@font-face
{font-family:Verdana;
panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:\"Calibri\",\"sans-serif\";}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:blue;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:purple;
text-decoration:underline;}
p.MsoAcetate, li.MsoAcetate, div.MsoAcetate
{mso-style-priority:99;
mso-style-link:\"Balloon Text Char\";
margin:0cm;
margin-bottom:.0001pt;
font-size:8.0pt;
font-family:\"Tahoma\",\"sans-serif\";}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:\"Calibri\",\"sans-serif\";
color:windowtext;}
span.BalloonTextChar
{mso-style-name:\"Balloon Text Char\";
mso-style-priority:99;
mso-style-link:\"Balloon Text\";
font-family:\"Tahoma\",\"sans-serif\";}
..MsoChpDefault
{mso-style-type:export-only;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:72.0pt 72.0pt 72.0pt 72.0pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext=3D\"edit\" spidmax=3D\"1026\" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext=3D\"edit\">
<o:idmap v:ext=3D\"edit\" data=3D\"1\" />
</o:shapelayout></xml><![endif]--></head><body lang=3DEN-GB link=3Dblue =
vlink=3Dpurple><div class=3DWordSection1><p class=3DMsoNormal><span =
style=3D\'font-size:7.5pt;font-family:\"Verdana\",\"sans-serif\";color:#1F497D=
\'>This Message is intended for the indicated recipients only and may be =
confidential. If this message has been sent to you in error you must =
take no action based on it, nor must you copy or show it to anyone; =
please inform us immediately and delete this message. </span><span =
style=3D\'color:#1F497D\'><o:p></o:p></span></p><p =
class=3DMsoNormal><o:p>&nbsp;</o:p></p></div></body></html>
------=_NextPart_000_1B20_01CCA865.03078710--

.

除了纯文本版本之外,我们要删除所有版本。任何Reg-Ex专家在那里解决这个问题?我们已经尝试了几个类和其他PHP系统,但它们总是返回最初输入的相同代码,而不是我们所追求的文本。有任何想法吗? RegEx优先。我们正在考虑检测text / plain和一系列换行符来检测纯文本内容....

1 个答案:

答案 0 :(得分:1)

RegEx首选

不,they are not

使用正确的MIME解析器(Google引发this one,我无法对其质量发表评论。)