Apache TikaParser引发无法捕获的异常

时间:2020-03-09 10:56:24

标签: java exception apache-tika

我目前正在尝试开发使用Apache TikaParser从不同文件中提取内容的工具。在大多数情况下,一切正常,但在某些文件中,Tika抛出以下异常:

Mar 09, 2020 11:21:58 AM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$€-2]\ * "-"_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$€-2]\ * "-"_)' with c2: null
        at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
        at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
        at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:191)
        at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:193)
        at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:167)
        at org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:343)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:901)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:873)
        at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell(FormatTrackingHSSFListener.java:143)
        at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell(ExcelExtractor.java:673)
        at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord(ExcelExtractor.java:447)
        at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord(ExcelExtractor.java:340)
        at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord(FormatTrackingHSSFListener.java:92)
        at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.processRecord(ExcelExtractor.java:666)
        at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord(HSSFRequest.java:109)
        at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:178)
        at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:135)
        at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:316)
        at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:169)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
        at attproc.processors.AttachmentProcessor.run(AttachmentProcessor.java:68)
        at attproc.Main.lambda$main$0(Main.java:89)
        at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at java.base/java.lang.Thread.run(Thread.java:834)

我正在尝试使用以下代码捕获此异常:

 try {
                        byte[] content = Files.readAllBytes(path);
                        try {
                            Metadata metadata = new Metadata();
                            BodyContentHandler handler = new BodyContentHandler(-1);
                            ParseContext parseContext = new ParseContext();
                            parseContext.set(PDFParserConfig.class, tikaConfig.pdfConfig);

                            try {
                                tikaConfig.autoDetectParser.parse(new ByteArrayInputStream(content), handler, metadata, parseContext);
                                text = Optional.ofNullable(handler.toString()).orElse("");
                            } catch (Exception ignored) {}

                        } catch (Exception ignored) {
                        }

                    } catch (IOException ignored) {
                    }

“ tikaConfig”是一个单例对象:

public class TikaConfiguration {
    private final TikaConfig tikaConfig;
    public final PDFParserConfig pdfConfig;
    public final Parser autoDetectParser;

    private static TikaConfiguration instance;

    private TikaConfiguration() throws Exception {
        ClassLoader classLoader = getClass().getClassLoader();
        InputStream stream = classLoader.getResourceAsStream("tikaconfig.xml");
        this.tikaConfig = new TikaConfig(stream);
        this.pdfConfig = new PDFParserConfig();
        pdfConfig.setExtractInlineImages(false);

        tikaConfig.getDetector();
        autoDetectParser = new AutoDetectParser(tikaConfig);
    }

    public static TikaConfiguration setConfiguration() {
        if (TikaConfiguration.instance == null) {
            try {
                TikaConfiguration.instance = new TikaConfiguration();
            } catch (Exception ignored) {}
        }

        return TikaConfiguration.instance;
    }
}

我该怎么办才能捕获此异常?

1 个答案:

答案 0 :(得分:0)

看看this有点旧的线程。您所看到的看起来非常相似。这表明Tika用来解析Excel的POI库正在引发警告,而不是错误(并且您的日志输出也反映了这一点)。该警告恰好在其日志记录中包含堆栈跟踪(我假设是POI捕获了该跟踪,然后将其传递给Tika)。

因此,您的代码不会捕获该警告(这不是引发的异常)。

正如一位评论员在JIRA中提到的:

我不确定这是否是错误。这是POILogger的输出,而不是例如printStackTrace()。

不管它的状态是什么bug,JIRA中都提出了一种解决方法:运行应用程序时,将err流重定向为null(提供了一个示例)。

我下载了JIRA随附的电子表格,并能够重新创建他们的消息版本:

WARNING: Invalid format: "_([$Ç-2]\ * #,##0.00_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ * #,##0.00_)' with c2: null
    at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
    at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
    at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:191)
    at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:193)
...

但是,我的程序成功完成了。它继续正确生成其输出。