Apache Tika元数据中的PropertyTypeException

时间:2017-07-26 09:41:56

标签: java apache-tika crawler4j

我使用Crawler4j提取页面和pdf文件。我已经检查过我得到的字节数组是否有效,可以输出到pdf文件。

使用此字节数组,我执行以下操作:

//Tika specific types
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
InputStream inputstream;
ParseContext pcontext = new ParseContext();
PDFParser pdfparser = new PDFParser();

...

byte[] contentData = null;
contentData = page.getContentData(); //Crawler4j content, delivers valid PDF
//Path path = Paths.get("C:\\Test\\local.pdf"); //use this line to read from a local pdf

//Default fields:
String title = "pdf title";
String content = "";
String suggestions = "";
//
try {
    ////contentData = Files.readAllBytes(path); //use this line to read from a local pdf
    inputstream = new ByteArrayInputStream(contentData);
    pdfparser.parse(inputstream, handler, metadata,pcontext); //THIS LINE CRASHES
    content = "pdf suggestions";
    suggestions = handler.toString();
} catch (Exception e) {
    LOGGER.warn("Error parsing with Tika.", e);
}

我标记了崩溃线。产生的异常如下:

WARN 2017-07-26 11:17:51,302 [Thread-5] de.searchadapter.crawler.solrparser.parser.file.PDFFileParser - Error parsing with Tika. org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID : SIMPLE at org.apache.tika.metadata.Metadata.add(Metadata.java:305) at org.apache.tika.parser.image.xmp.JempboxExtractor.addMetadata(JempboxExtractor.java:209) at org.apache.tika.parser.image.xmp.JempboxExtractor.extractXMPMM(JempboxExtractor.java:150) at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:239) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:154) at de.searchadapter.crawler.solrparser.parser.file.PDFFileParser.parse(PDFFileParser.java:82) at de.searchadapter.crawler.solrparser.SolrParser.parse(SolrParser.java:36) at de.searchadapter.crawler.SolrJAdapter.indexDocs(SolrJAdapter.java:58) at de.searchadapter.crawler.WebCrawler.onBeforeExit(WebCrawler.java:63) at edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:309) at java.lang.Thread.run(Thread.java:745)

上面的代码来自PDFFileParser。我没有设置任何属性,所以我很困惑这个错误的来源。

其他信息:PDF文件似乎使用未知字体,出现以下警告:

11:17:50.963 [Thread-5] WARN o.a.pdfbox.pdmodel.font.PDSimpleFont - No Unicode mapping for f_i (30) in font GGOLOE+TheSansC5-Plain

编辑:我编辑了代码,以便它可以读取本地pdf文件。我尝试了另一个PDF文件并没有收到错误。这似乎是失败字体的结果。

0 个答案:

没有答案