SOLR7错误解析Word文档

时间:2018-08-13 14:15:05

标签: java solrcloud full-text-indexing

我使用SOLR7完全提取Windows doc文件。我有这个错误:

o.a.s.h.RequestHandlerBase org.apache.poi.poifs.filesystem.NotOLE2FileException:标头签名无效;读取0x0A1A0A0D474E5089,预期为0xE11AB1A1E011CFD0-您的文件似乎不是有效的OLE2文档     在org.apache.poi.poifs.storage.HeaderBlock(HeaderBlock.java:144)     在org.apache.poi.poifs.storage.HeaderBlock。(HeaderBlock.java:113)     在org.apache.poi.poifs.filesystem.NPOIFSFileSystem。(NPOIFSFileSystem.java:301)     在org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:124)     在org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)     在org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)     在org.apache.tika.extractor.EmbeddedDocumentUtil.parseEmbedded(EmbeddedDocumentUtil.java:220)     在org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:124)     在org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:100)     在org.apache.tika.parser.microsoft.WordExtractor.handlePictureCharctureerRun(WordExtractor.java:640)     在org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:372)     在org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:259)     在org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:182)     在org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:176)     在org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132)     在org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)     在org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)     在org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)     在org.apache.solr.core.SolrCore.execute(SolrCore.java:2539)     在org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)     在org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)     在org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)     在org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)     在org.eclipse.jetty.servlet.ServletHandler $ CachedChain.doFilter(ServletHandler.java:1634)     在org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)     在org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)     在org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)     在org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)     在org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)     在org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)     在org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)     在org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)     在org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)     在org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)     在org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)     在org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)     在org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)     在org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)     在org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)     在org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)     在org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)     在org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)     在org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:169)     在org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)     在org.eclipse.jetty.server.Server.handle(Server.java:531)     在org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)     在org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)     在org.eclipse.jetty.io.AbstractConnection $ ReadCallback.succeeded(AbstractConnection.java:281)     在org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)     在org.eclipse.jetty.io.ChannelEndPoint $ 2.run(ChannelEndPoint.java:118)     在org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:760)     在org.eclipse.jetty.util.thread.QueuedThreadPool $ 2.run(QueuedThreadPool.java:678)     在java.lang.Thread.run(Thread.java:748)

2018-08-13 13:50:58.445错误(qtp1671846437-23)[c:bdl s:shard1 r:core_node3 x:bdl_shard1_replica_n1] oassHttpSolrCall null:org.apache.poi.poifs.filesystem.NotOLE2FileException:无效标头签名;读取0x0A1A0A0D474E5089,预期为0xE11AB1A1E011CFD0-您的文件似乎不是有效的OLE2文档     在org.apache.poi.poifs.storage.HeaderBlock(HeaderBlock.java:144)     在org.apache.poi.poifs.storage.HeaderBlock。(HeaderBlock.java:113)     在org.apache.poi.poifs.filesystem.NPOIFSFileSystem。(NPOIFSFileSystem.java:301)     在org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:124)     在org.apache.tika.parser.DelegatingParser.parse(DelegatingParser.java:72)     在org.apache.tika.extractor.ParsingEmbeddedDocumentExtractor.parseEmbedded(ParsingEmbeddedDocumentExtractor.java:102)     在org.apache.tika.extractor.EmbeddedDocumentUtil.parseEmbedded(EmbeddedDocumentUtil.java:220)     在org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:124)     在org.apache.tika.parser.microsoft.AbstractPOIFSExtractor.handleEmbeddedResource(AbstractPOIFSExtractor.java:100)     在org.apache.tika.parser.microsoft.WordExtractor.handlePictureCharctureerRun(WordExtractor.java:640)     在org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:372)     在org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:259)     在org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:182)     在org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:176)     在org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:132)     在org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)     在org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)     在org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:199)     在org.apache.solr.core.SolrCore.execute(SolrCore.java:2539)     在org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:709)     在org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:515)     在org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:377)     在org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:323)     在org.eclipse.jetty.servlet.ServletHandler $ CachedChain.doFilter(ServletHandler.java:1634)     在org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:533)     在org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146)     在org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)     在org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)     在org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257)     在org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1595)     在org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)     在org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1253)     在org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)     在org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:473)     在org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1564)     在org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)     在org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1155)     在org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)     在org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:219)     在org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126)     在org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)     在org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)     在org.eclipse.jetty.server.handler.StatisticsHandler.handle(StatisticsHandler.java:169)     在org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)     在org.eclipse.jetty.server.Server.handle(Server.java:531)     在org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)     在org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)     在org.eclipse.jetty.io.AbstractConnection $ ReadCallback.succeeded(AbstractConnection.java:281)     在org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)     在org.eclipse.jetty.io.ChannelEndPoint $ 2.run(ChannelEndPoint.java:118)     在org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:760)     在org.eclipse.jetty.util.thread.QueuedThreadPool $ 2.run(QueuedThreadPool.java:678)     在java.lang.Thread.run(Thread.java:748)

但是使用SOLR 5.5我没有错误,知道吗?

1 个答案:

答案 0 :(得分:0)

文档中有图片时出现问题。图片的标头被检索到,POI还需要其他内容。...但是文档的标头可以。

read 0x0A1A0A0D474E5089, expected 0xE11AB1A1E011CFD0

期望是来自文档单词标题,另一个是例如图片标题。即使有其他元素,解析器也会始终等待相同的标头。更改解析器将解决问题。