如何在Java源代码中使用TikaCLI功能?

时间:2016-04-12 13:51:24

标签: java eclipse apache apache-tika xslf

我正在尝试使用Apache Tika从office文档中提取嵌入式文件。使用Tika CLI(cmd),一切都运行良好。但我必须将它集成到Eclipse中的Java源代码中。

所以我做的是:

public static void saveEmbedds(String inputfile, String outputfile) throws Exception{
    try{
        String[] arguments = new String[]{"-z", "--extract-dir=" + removeExtension(outputfile), inputfile};
        System.out.println("Using TIKA CLI to dedect embedded Files. Target Directory: "+ removeExtension(outputfile));
        TikaCLI.main(arguments);
    }
    catch(Exception e){
        logger.info("Exception in saveEmbedds, during search in File: " + inputfile + "\r\nDetails: " + e);
    }

}

这实际上适用于每种文件类型,期望.pptx。当inputfile是.pptx文件时,它会产生很多错误。使用cmd同样适用。

12.04.2016 15:31:33 945     Exception in thread "main" java.lang.NoSuchMethodError: org.apache.poi.xslf.usermodel.XSLFTextShape.getTextType()Lorg/apache/poi/xslf/usermodel/Placeholder; 
12.04.2016 15:31:33 945     at org.apache.tika.parser.microsoft.ooxml.XSLFPowerPointExtractorDecorator.extractContent(XSLFPowerPointExtractorDecorator.java:154) 
12.04.2016 15:31:33 945     at org.apache.tika.parser.microsoft.ooxml.XSLFPowerPointExtractorDecorator.buildXHTML(XSLFPowerPointExtractorDecorator.java:88) 
12.04.2016 15:31:33 945     at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110) 
12.04.2016 15:31:33 945     at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112) 
12.04.2016 15:31:33 945     at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87) 
12.04.2016 15:31:33 945     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
12.04.2016 15:31:33 945     at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) 
12.04.2016 15:31:33 945     at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) 
12.04.2016 15:31:33 945     at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190) 
12.04.2016 15:31:33 945     at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491) 
12.04.2016 15:31:33 945     at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144) 

有没有更好的方法来使用Apache Tika CLI的功能?我还尝试了ExtractEmbeddedFiles的示例代码,但我没有使用嵌入式.ppt文件。

0 个答案:

没有答案