我使用Tika自动检测被推入DMS的文档的内容类型。除了电子邮件之外,几乎所有工作都很好。
我必须区分标准邮件消息(mime => message / rfc822)和签名邮件消息(mime => multipart / signed),但所有电子邮件都被检测为message / rfc822。
未正确检测到的已签名邮件具有以下内容类型标题:
Content-Type: multipart/signed; protocol="application/x-pkcs7-signature"; micalg=sha1; boundary="----4898E6D8BDE1929CA602BE94D115EF4C"
我用于解析的java代码是:
Detector detector;
List<Detector> detectors = new ArrayList<Detector>();
detectors.add(new ZipContainerDetector());
detectors.add(new POIFSContainerDetector());
detectors.add(MimeTypes.getDefaultMimeTypes());
detector = new CompositeDetector(detectors);
String mimetype = detector.detect(TikaInputStream.get(new File(args[0])), new Metadata()).toString();
我引用核心库和tika-parser来检测pdf和msword文档。我错过了别的什么吗?
答案 0 :(得分:1)
我解决了我的问题。我通过实现Detector
接口实现了自定义检测器:
public class MultipartSignedDetector implements Detector {
@Override
public MediaType detect(InputStream is, Metadata metadata) throws IOException {
TemporaryResources tmp = new TemporaryResources();
TikaInputStream tis = TikaInputStream.get(is, tmp);
tis.mark(Integer.MAX_VALUE);
try {
MimeMessage mimeMessage = null;
String host = "host.com";
Properties properties = System.getProperties();
properties.setProperty("mail.smtp.host", host);
Session session = Session.getDefaultInstance(properties);
mimeMessage = new MimeMessage(session, tis);
if(mimeMessage.getContentType() != null && mimeMessage.getMessageID() != null && mimeMessage.getContentType().toLowerCase().contains("multipart/signed"))
return new MediaType("multipart", "signed");
else
return MediaType.OCTET_STREAM;
} catch(Exception e) {
return MediaType.OCTET_STREAM;
} finally {
try {
tis.reset();
tmp.dispose();
} catch (TikaException e) {
// ignore
}
}
}
}
然后在默认检测器之前将自定义检测器添加到复合检测器:
Detector detector;
List<Detector> detectors = new ArrayList<Detector>();
detectors.add(new ZipContainerDetector());
detectors.add(new POIFSContainerDetector());
detectors.add(new MultipartSignedDetector());
detectors.add(MimeTypes.getDefaultMimeTypes());
detector = new CompositeDetector(detectors);
String mimetype = detector.detect(TikaInputStream.get(new File(args[0])), new Metadata()).toString();