Nutch 2.2.1 PDF解析

时间:2013-12-20 11:43:31

标签: java parsing pdf hadoop nutch

试图用Nutch和Tika解析PDF文档(PDFBox 1.8.3)。

我尝试使用以下内容解析的5个PDF:

danny@Ubuntu-64:~/Nutch$ ./bin/nutch parsechecker file:///home/danny/Documents/DOC-443.pdf

我得到的唯一输出是:

fetching: file:///home/danny/Documents/DOC-443.pdf
parsing: file:///home/danny/Documents/DOC-443.pdf
contentType: application/pdf
signature: 662453bc32a42af13cb4d5844d978cfc
---------
Url
---------------
file:///home/danny/Documents/DOC-443.pdf
---------
Metadata
---------
xmpTPg:NPages :     0
Content-Type :  application/pdf

我的hadoop.log是:

2013-12-20 11:29:41,646 INFO  parse.ParserChecker - fetching: file:///home/danny/Documents/DOC-443.pdf
2013-12-20 11:29:42,174 INFO  crawl.SignatureFactory - Using Signature impl: org.apache.nutch.crawl.MD5Signature
2013-12-20 11:29:42,209 INFO  parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and all claim to support the content type application/pdf, but they are not mapped to it  in the parse-plugins.xml file
2013-12-20 11:29:42,518 WARN  pdfparser.PDFParser - Parsing Error, Skipping Object
java.io.IOException: expected='endstream' actual='' org.apache.pdfbox.io.PushBackInputStream@2d4b2312
    at org.apache.pdfbox.pdfparser.BaseParser.parseCOSStream(BaseParser.java:604)
    at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:605)
    at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:194)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1224)
    at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1189)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:123)
    at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:116)
    at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:36)
    at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:23)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)
2013-12-20 11:29:42,521 WARN  pdfparser.XrefTrailerResolver - Did not found XRef object at specified startxref position 0
2013-12-20 11:29:42,611 INFO  parse.ParserChecker - parsing: file:///home/danny/Documents/DOC-443.pdf
2013-12-20 11:29:42,611 INFO  parse.ParserChecker - contentType: application/pdf
2013-12-20 11:29:42,611 INFO  parse.ParserChecker - signature: 662453bc32a42af13cb4d5844d978cfc
2013-12-20 11:29:42,611 INFO  parse.ParserChecker - ---------
Url
---------------
2013-12-20 11:29:42,612 INFO  parse.ParserChecker - ---------
Metadata
---------

谁能弄明白什么是错的?过去两天一直试图解决这个问题。升级/降级PDFBox,重建Nutch等。似乎没有什么能解决这个问题?

0 个答案:

没有答案