试图用Nutch和Tika解析PDF文档(PDFBox 1.8.3)。
我尝试使用以下内容解析的5个PDF:
danny@Ubuntu-64:~/Nutch$ ./bin/nutch parsechecker file:///home/danny/Documents/DOC-443.pdf
我得到的唯一输出是:
fetching: file:///home/danny/Documents/DOC-443.pdf
parsing: file:///home/danny/Documents/DOC-443.pdf
contentType: application/pdf
signature: 662453bc32a42af13cb4d5844d978cfc
---------
Url
---------------
file:///home/danny/Documents/DOC-443.pdf
---------
Metadata
---------
xmpTPg:NPages : 0
Content-Type : application/pdf
我的hadoop.log是:
2013-12-20 11:29:41,646 INFO parse.ParserChecker - fetching: file:///home/danny/Documents/DOC-443.pdf
2013-12-20 11:29:42,174 INFO crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature
2013-12-20 11:29:42,209 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/pdf, but they are not mapped to it in the parse-plugins.xml file
2013-12-20 11:29:42,518 WARN pdfparser.PDFParser - Parsing Error, Skipping Object
java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@2d4b2312
at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:604)
at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1224)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1189)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:123)
at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:116)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
2013-12-20 11:29:42,521 WARN pdfparser.XrefTrailerResolver - Did not found XRef object at specified startxref position 0
2013-12-20 11:29:42,611 INFO parse.ParserChecker - parsing: file:///home/danny/Documents/DOC-443.pdf
2013-12-20 11:29:42,611 INFO parse.ParserChecker - contentType: application/pdf
2013-12-20 11:29:42,611 INFO parse.ParserChecker - signature: 662453bc32a42af13cb4d5844d978cfc
2013-12-20 11:29:42,611 INFO parse.ParserChecker - ---------
Url
---------------
2013-12-20 11:29:42,612 INFO parse.ParserChecker - ---------
Metadata
---------
谁能弄明白什么是错的?过去两天一直试图解决这个问题。升级/降级PDFBox,重建Nutch等。似乎没有什么能解决这个问题?