我目前正在尝试开发使用Apache TikaParser从不同文件中提取内容的工具。在大多数情况下,一切正常,但在某些文件中,Tika抛出以下异常:
Mar 09, 2020 11:21:58 AM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$€-2]\ * "-"_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$€-2]\ * "-"_)' with c2: null
at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:191)
at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:193)
at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:167)
at org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:343)
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:901)
at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:873)
at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.formatNumberDateCell(FormatTrackingHSSFListener.java:143)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.formatNumberDateCell(ExcelExtractor.java:673)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.internalProcessRecord(ExcelExtractor.java:447)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processRecord(ExcelExtractor.java:340)
at org.apache.poi.hssf.eventusermodel.FormatTrackingHSSFListener.processRecord(FormatTrackingHSSFListener.java:92)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener$TikaFormatTrackingHSSFListener.processRecord(ExcelExtractor.java:666)
at org.apache.poi.hssf.eventusermodel.HSSFRequest.processRecord(HSSFRequest.java:109)
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.genericProcessEvents(HSSFEventFactory.java:178)
at org.apache.poi.hssf.eventusermodel.HSSFEventFactory.processEvents(HSSFEventFactory.java:135)
at org.apache.tika.parser.microsoft.ExcelExtractor$TikaHSSFListener.processFile(ExcelExtractor.java:316)
at org.apache.tika.parser.microsoft.ExcelExtractor.parse(ExcelExtractor.java:169)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:183)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:131)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at attproc.processors.AttachmentProcessor.run(AttachmentProcessor.java:68)
at attproc.Main.lambda$main$0(Main.java:89)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
我正在尝试使用以下代码捕获此异常:
try {
byte[] content = Files.readAllBytes(path);
try {
Metadata metadata = new Metadata();
BodyContentHandler handler = new BodyContentHandler(-1);
ParseContext parseContext = new ParseContext();
parseContext.set(PDFParserConfig.class, tikaConfig.pdfConfig);
try {
tikaConfig.autoDetectParser.parse(new ByteArrayInputStream(content), handler, metadata, parseContext);
text = Optional.ofNullable(handler.toString()).orElse("");
} catch (Exception ignored) {}
} catch (Exception ignored) {
}
} catch (IOException ignored) {
}
“ tikaConfig”是一个单例对象:
public class TikaConfiguration {
private final TikaConfig tikaConfig;
public final PDFParserConfig pdfConfig;
public final Parser autoDetectParser;
private static TikaConfiguration instance;
private TikaConfiguration() throws Exception {
ClassLoader classLoader = getClass().getClassLoader();
InputStream stream = classLoader.getResourceAsStream("tikaconfig.xml");
this.tikaConfig = new TikaConfig(stream);
this.pdfConfig = new PDFParserConfig();
pdfConfig.setExtractInlineImages(false);
tikaConfig.getDetector();
autoDetectParser = new AutoDetectParser(tikaConfig);
}
public static TikaConfiguration setConfiguration() {
if (TikaConfiguration.instance == null) {
try {
TikaConfiguration.instance = new TikaConfiguration();
} catch (Exception ignored) {}
}
return TikaConfiguration.instance;
}
}
我该怎么办才能捕获此异常?
答案 0 :(得分:0)
看看this有点旧的线程。您所看到的看起来非常相似。这表明Tika用来解析Excel的POI库正在引发警告,而不是错误(并且您的日志输出也反映了这一点)。该警告恰好在其日志记录中包含堆栈跟踪(我假设是POI捕获了该跟踪,然后将其传递给Tika)。
因此,您的代码不会捕获该警告(这不是引发的异常)。
正如一位评论员在JIRA中提到的:
我不确定这是否是错误。这是POILogger的输出,而不是例如printStackTrace()。
不管它的状态是什么bug,JIRA中都提出了一种解决方法:运行应用程序时,将err流重定向为null(提供了一个示例)。
我下载了JIRA随附的电子表格,并能够重新创建他们的消息版本:
WARNING: Invalid format: "_([$Ç-2]\ * #,##0.00_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ * #,##0.00_)' with c2: null
at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:373)
at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:287)
at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:191)
at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:193)
...
但是,我的程序成功完成了。它继续正确生成其输出。