当附件在Microsoft Outlook中保存的消息时,它会保存为“.msg”文件,其中包含电子邮件的所有内容以及附件文件。我想提取电子邮件正文的文本内容及其附件。 Apache Tika是否支持'.msg'文件?如果没有其他想法?
答案 0 :(得分:1)
如果查看list of mail formats supported by Apache Tika 1.9(当前是最新版本),您会看到Outlook MSG文件被列为受支持。
从Apache POI project's test files获取一个简单的MSG文件示例,并使用Tika App独立jar来简化测试,我们可以轻松获取内容和元数据:
$ java -jar tika-app-1.9.jar --metadata simple_test_msg.msg
Author: Travis Ferguson
Content-Length: 16896
Content-Type: application/vnd.ms-outlook
Creation-Date: 2007-07-06T05:27:17Z
Last-Modified: 2007-07-06T05:27:17Z
Last-Save-Date: 2007-07-06T05:27:17Z
Message-Bcc:
Message-Cc:
Message-From: Travis Ferguson
Message-Recipient-Address: travis@overwrittenstack.com
Message-To: travis@overwrittenstack.com
X-Parsed-By: org.apache.tika.parser.DefaultParser
X-Parsed-By: org.apache.tika.parser.microsoft.OfficeParser
creator: Travis Ferguson
date: 2007-07-06T05:27:17Z
dc:creator: Travis Ferguson
dc:description: test message
dc:title: test message
dcterms:created: 2007-07-06T05:27:17Z
dcterms:modified: 2007-07-06T05:27:17Z
meta:author: Travis Ferguson
meta:creation-date: 2007-07-06T05:27:17Z
meta:save-date: 2007-07-06T05:27:17Z
modified: 2007-07-06T05:27:17Z
resourceName: simple_test_msg.msg
subject: test message
title: test message
$ java -jar tika-app-1.9.jar --text simple_test_msg.msg
test message
From
Travis Ferguson
To
travis@overwrittenstack.com
Recipients
travis@overwrittenstack.com
This is a test message.
元数据,包括发件人,收件人,日期等,文字,所有你想要的!
或者,如果您有特殊需求/要求并希望完全控制,则可以使用基础Apache POI HSMF library来解析MSG文件,查看HSMF unit tests的使用示例
答案 1 :(得分:-1)
Tika支持msg files
您可以使用apache POI周围有一些示例one
样品:
public static void main(String[] args) throws Exception{
MsgParser msgp = new MsgParser();
Message msg = msgp.parseMsg("c:/temp/test2.msg");
String fromEmail = msg.getFromEmail();
String fromName = msg.getFromName();
String subject = msg.getSubject();
String body = msg.getBodyText();
System.out.println("From :" + fromName + " <" + fromEmail + ">");
System.out.println("Subject :" + subject);
System.out.println("");
System.out.println(body);
System.out.println("");
List atts = msg.getAttachments();
for (Attachment att : atts) {
if (att instanceof FileAttachment) {
FileAttachment file = (FileAttachment) att;
System.out.println("Attachment : " + file.getFilename());
// you get the actual attachment with
// byte date[] = file.getData();
}
}
}