R

时间:2015-11-06 09:19:03

标签: r email parsing

我每月收到大约55,000封邮件。我已经完成了分析邮件的任务。虽然.eml有很多内容,但我通常对电子邮件文本内容感兴趣,如下所示:

From: "SavReader" <info@savreader.com>
To: <pgmagesh@gmail.com>
Subject: Export file SavReader.com
Date: Mon, 2 Nov 2015 08:37:52 +0100
MIME-Version: 1.0
Content-Type: multipart/alternative;
    boundary="----=_NextPart_000_0000_01D11549.C39BD260"

This is a multi-part message in MIME format.

------=_NextPart_000_0000_01D11549.C39BD260
Content-Type: text/plain;
    charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

Hello from SavReader!

The file that you submitted for export is now available for download
from SavReader - all files will be stored for 1 day from receipt of
this mail.

Download file <http://www.savreader.com/export/qlysDuv1xQ.xls> 

Thanks, 

Team SavReader

------=_NextPart_000_0000_01D11549.C39BD260
Content-Type: text/html
Content-Transfer-Encoding: 7bit

Hello from SavReader!<br><br>The file that you submitted for export is now available for download from SavReader - all files will be stored for 1 day from receipt of this mail.<br><br><a href=http://www.savreader.com/export/qlysDuv1xQ.xls>Download file</a><br><br>Thanks, <br><br>Team SavReader
------=_NextPart_000_0000_01D11549.C39BD260--

我有兴趣提取主题:,发件人:和邮件的内容。虽然在Content-Type: text/plain; charset=UTF-8以及Content-Type: text/html; charset=UTF-8中都提取了邮件正文,但我认为在--001a113d7c1e5de339051fdaaf69之前和之后我得到了匹配的一对分隔字符串--001a113d7c1e5de339051fdaaf69--(结束分隔符结束)另外两个&#34; - &#34;)邮件或邮件正文夹在这些uid之间。我试图解析邮件的电子邮件ID和内容。我使用了以下代码,其中a是我的.eml文件的名称:

pat = '([From]): ([a-zA-Z]) (([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,6}))'
d <- str_match(pattern = pat, a)
d

另一种选择:

strsplit(gsub("(?s)^_+\\s+", "", a, perl=T) , "_+\\s*(?=From:)", perl=T)[[1]]

另一种选择:

d <- str_extract(string=a,pattern="From:\\b[-A-Za-z0-9_.%]+\\@[-A-Za-z0-9_.%]+\\.[A-Za-z]+")

以及SO中给出的许多其他选项。我想要提取的是:

From: DEFHIJ <abc@xyz.in>

和匹配分隔符字符串之间的邮件的html内容。有人可以帮忙吗?

1 个答案:

答案 0 :(得分:0)

你快到了。您需要包含尖括号,并使用From:匹配\\s*与第一个字母之间的空格。

str_extract(string=a,pattern="^From:\\s*[-A-Za-z0-9_.%]+\\s*<[-A-Za-z0-9_.%]+@[-A-Za-z0-9_.%]+\\.[A-Za-z]+>")

DEMO

str_extract(a, "^From:.*")