我每月收到大约55,000封邮件。我已经完成了分析邮件的任务。虽然.eml有很多内容,但我通常对电子邮件文本内容感兴趣,如下所示:
From: "SavReader" <info@savreader.com>
To: <pgmagesh@gmail.com>
Subject: Export file SavReader.com
Date: Mon, 2 Nov 2015 08:37:52 +0100
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_NextPart_000_0000_01D11549.C39BD260"
This is a multi-part message in MIME format.
------=_NextPart_000_0000_01D11549.C39BD260
Content-Type: text/plain;
charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Hello from SavReader!
The file that you submitted for export is now available for download
from SavReader - all files will be stored for 1 day from receipt of
this mail.
Download file <http://www.savreader.com/export/qlysDuv1xQ.xls>
Thanks,
Team SavReader
------=_NextPart_000_0000_01D11549.C39BD260
Content-Type: text/html
Content-Transfer-Encoding: 7bit
Hello from SavReader!<br><br>The file that you submitted for export is now available for download from SavReader - all files will be stored for 1 day from receipt of this mail.<br><br><a href=http://www.savreader.com/export/qlysDuv1xQ.xls>Download file</a><br><br>Thanks, <br><br>Team SavReader
------=_NextPart_000_0000_01D11549.C39BD260--
我有兴趣提取主题:,发件人:和邮件的内容。虽然在Content-Type: text/plain; charset=UTF-8
以及Content-Type: text/html; charset=UTF-8
中都提取了邮件正文,但我认为在--001a113d7c1e5de339051fdaaf69
之前和之后我得到了匹配的一对分隔字符串--001a113d7c1e5de339051fdaaf69--
(结束分隔符结束)另外两个&#34; - &#34;)邮件或邮件正文夹在这些uid之间。我试图解析邮件的电子邮件ID和内容。我使用了以下代码,其中a是我的.eml文件的名称:
pat = '([From]): ([a-zA-Z]) (([a-z0-9_\\.-]+)@([\\da-z\\.-]+)\\.([a-z\\.]{2,6}))'
d <- str_match(pattern = pat, a)
d
另一种选择:
strsplit(gsub("(?s)^_+\\s+", "", a, perl=T) , "_+\\s*(?=From:)", perl=T)[[1]]
另一种选择:
d <- str_extract(string=a,pattern="From:\\b[-A-Za-z0-9_.%]+\\@[-A-Za-z0-9_.%]+\\.[A-Za-z]+")
以及SO中给出的许多其他选项。我想要提取的是:
From: DEFHIJ <abc@xyz.in>
和匹配分隔符字符串之间的邮件的html内容。有人可以帮忙吗?
答案 0 :(得分:0)
你快到了。您需要包含尖括号,并使用From:
匹配\\s*
与第一个字母之间的空格。
str_extract(string=a,pattern="^From:\\s*[-A-Za-z0-9_.%]+\\s*<[-A-Za-z0-9_.%]+@[-A-Za-z0-9_.%]+\\.[A-Za-z]+>")
或
str_extract(a, "^From:.*")