我使用solr索引数千个文档,并且它工作得很好。当某些文档格式不正确或包含一些特殊字符时会出现问题,这会导致solr挂起或阻塞某些特定文档,并且在查看时会出现这些错误日志:
Exception while processing: files document : null:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 1
at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:70)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:165)
at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:515)
at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:417)
at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:481)
at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:462)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@2cc58e97
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:258)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:159)
... 9 more
Caused by: java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(Unknown Source)
at org.apache.tika.parser.microsoft.WordExtractor.handleSpecialCharacterRuns(WordExtractor.java:407)
at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:256)
at org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:196)
at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:105)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:201)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:172)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
... 12 more
我想检测哪些文件导致了这些问题,或者至少指向一些我遗失的库。 提前致谢