使用Tika1.3(+ lucene4.2)无法解析pdf

时间:2013-05-07 17:19:16

标签: parsing lucene apache-tika pdf-parsing

我试图解析pdf文件并获取其元数据和文本。我仍然没有得到想要的结果。我确信这是一个愚蠢的错误,但我无法看到它。文件 d.pdf 存在,它位于项目的根文件夹中。导入也是正确的。

public class MultiParse {
      public static void main(final String[] args) throws IOException,
                  SAXException, TikaException {
            Parser parser = new AutoDetectParser();
            File f = new File("d.pdf");        
            System.out.println("------------ Parsing a PDF:");
            extractFromFile(parser, f);
      }

      private static void extractFromFile(final Parser parser,
                  final File f ) throws IOException, SAXException,
                  TikaException {
            BodyContentHandler handler = new BodyContentHandler(10000000);
            Metadata metadata = new Metadata();
            InputStream is = TikaInputStream.get(f);
            parser.parse(is, handler, metadata, new ParseContext());
            for (String name : metadata.names()) {
                  System.out.println(name + ":\t" + metadata.get(name));
            }
      }
}

输出:没有错误,但是......也不多:(

------------ Parsing a PDF:
Content-Type:   application/pdf

0 个答案:

没有答案