我使用Crawler4j提取页面和pdf文件。我已经检查过我得到的字节数组是否有效,可以输出到pdf文件。
使用此字节数组,我执行以下操作:
//Tika specific types
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
InputStream inputstream;
ParseContext pcontext = new ParseContext();
PDFParser pdfparser = new PDFParser();
...
byte[] contentData = null;
contentData = page.getContentData(); //Crawler4j content, delivers valid PDF
//Path path = Paths.get("C:\\Test\\local.pdf"); //use this line to read from a local pdf
//Default fields:
String title = "pdf title";
String content = "";
String suggestions = "";
//
try {
////contentData = Files.readAllBytes(path); //use this line to read from a local pdf
inputstream = new ByteArrayInputStream(contentData);
pdfparser.parse(inputstream, handler, metadata,pcontext); //THIS LINE CRASHES
content = "pdf suggestions";
suggestions = handler.toString();
} catch (Exception e) {
LOGGER.warn("Error parsing with Tika.", e);
}
我标记了崩溃线。产生的异常如下:
WARN 2017-07-26 11:17:51,302 [Thread-5] de.searchadapter.crawler.solrparser.parser.file.PDFFileParser - Error parsing with Tika.
org.apache.tika.metadata.PropertyTypeException: xmpMM:DocumentID : SIMPLE
at org.apache.tika.metadata.Metadata.add(Metadata.java:305)
at org.apache.tika.parser.image.xmp.JempboxExtractor.addMetadata(JempboxExtractor.java:209)
at org.apache.tika.parser.image.xmp.JempboxExtractor.extractXMPMM(JempboxExtractor.java:150)
at org.apache.tika.parser.pdf.PDFParser.extractMetadata(PDFParser.java:239)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:154)
at de.searchadapter.crawler.solrparser.parser.file.PDFFileParser.parse(PDFFileParser.java:82)
at de.searchadapter.crawler.solrparser.SolrParser.parse(SolrParser.java:36)
at de.searchadapter.crawler.SolrJAdapter.indexDocs(SolrJAdapter.java:58)
at de.searchadapter.crawler.WebCrawler.onBeforeExit(WebCrawler.java:63)
at edu.uci.ics.crawler4j.crawler.CrawlController$1.run(CrawlController.java:309)
at java.lang.Thread.run(Thread.java:745)
上面的代码来自PDFFileParser
。我没有设置任何属性,所以我很困惑这个错误的来源。
其他信息:PDF文件似乎使用未知字体,出现以下警告:
11:17:50.963 [Thread-5] WARN o.a.pdfbox.pdmodel.font.PDSimpleFont - No Unicode mapping for f_i (30) in font GGOLOE+TheSansC5-Plain
编辑:我编辑了代码,以便它可以读取本地pdf文件。我尝试了另一个PDF文件并没有收到错误。这似乎是失败字体的结果。