将HTML文档发送给Solr

时间:2018-04-06 18:46:08

标签: solr apache-tika

我正在向Solr发送一份HTML文件,Tika正在扔“检测到拉链炸弹!”异常回来。 Solr日志报告:“疑似拉链炸弹:100级XML元素嵌套”

查看Tika源代码,XML元素嵌套(See here)的任意限制为100级。

有问题的变量(maxDepth)确实有一个公共setter函数,但我不确定是否可以在Solr中设置它。有可能吗?

这是完整的堆栈跟踪:

2018-04-05 16:47:48.034 ERROR (qtp1654589030-15) [   x:aconn] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Zip bomb detected!
    at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:234)
    at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:68)
    at ca.calgary.csc.wds.solr.GsaAconnRequestHandler.handleRequestBody(GsaAconnRequestHandler.java:84)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:177)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:2503)
    at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:710)
    at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:516)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:382)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:326)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1751)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:582)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:512)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:213)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at org.eclipse.jetty.server.Server.handle(Server.java:534)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:320)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
    at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.tika.exception.TikaException: Zip bomb detected!
    at org.apache.tika.sax.SecureContentHandler.throwIfCauseOf(SecureContentHandler.java:192)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:138)
    at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:228)
    ... 35 more
Caused by: org.apache.tika.sax.SecureContentHandler$SecureSAXException: Suspected zip bomb: 100 levels of XML element nesting
    at org.apache.tika.sax.SecureContentHandler.startElement(SecureContentHandler.java:234)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
    at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:255)
    at org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:297)
    at org.apache.tika.parser.html.HtmlHandler.startElementWithSafeAttributes(HtmlHandler.java:251)
    at org.apache.tika.parser.html.HtmlHandler.startElement(HtmlHandler.java:167)
    at org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
    at org.apache.tika.parser.html.XHTMLDowngradeHandler.startElement(XHTMLDowngradeHandler.java:60)
    at org.ccil.cowan.tagsoup.Parser.push(Parser.java:794)
    at org.ccil.cowan.tagsoup.Parser.rectify(Parser.java:1061)
    at org.ccil.cowan.tagsoup.Parser.stagc(Parser.java:1016)
    at org.ccil.cowan.tagsoup.HTMLScanner.scan(HTMLScanner.java:625)
    at org.ccil.cowan.tagsoup.Parser.parse(Parser.java:449)
    at org.apache.tika.parser.html.HtmlParser.parse(HtmlParser.java:135)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    ... 36 more

编辑:我发现Jira issue似乎是以类似的方式造成的。 Tim Allison给出的解决方案是使用Tika的默认HTML映射器而不是Solr的映射器。 如何在Solr配置中进行设置?

Edit2:我已经验证这是 Tika问题,因为tika-app jar能够成功提取文件内容

>java -jar tika-app-1.16.jar -t test.html

1 个答案:

答案 0 :(得分:0)

根据Tim的说法,无法通过Solr配置进行设置。作为替代方案,我在其他地方提到的建议是在Solr之外运行Tika,即不使用Solr Cell