处理Tika文档时SOLR导入崩溃

时间:2011-11-10 16:02:08

标签: solr apache-tika

使用 Tika 导入 Solr 时遇到困难,我的文档在索引网页时会一直崩溃。

我正在删除Tika文档的内容并重新启动导入,但这非常繁琐,我显然丢失了这些文档的内容。

以下是崩溃日志:

org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 927
    at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
    at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:130)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@b623d7
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
    at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
    ... 8 more
Caused by: java.lang.NullPointerException

Nov 10, 2011 10:51:29 AM org.apache.solr.common.SolrException log
SEVERE: Full Import failed:org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read content Processing Document # 927
    at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
    at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:130)
    at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:238)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:596)
    at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:622)
    at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:268)
    at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:187)
    at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:359)
    at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:427)
    at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:408)
Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.ParserDecorator$1@b623d7
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:199)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
    at org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:128)
    ... 8 more
Caused by: java.lang.NullPointerException

崩溃的数据示例:

pageText=pageText(1.0)={<table width="100%" height="100%" border="0" cellpadding="0" cellspacing="0" nodeIndex="3" class="ril_layoutTable">
    <tr nodeIndex="2">
        <td width="50%" rowspan="3" nodeIndex="1">&nbsp;</td>           
        <td width="1" rowspan="3" nodeIndex="4"></td>           
        <td nodeIndex="5">          
            <!-- ImageReady Slices (headergraphics.psd) -->
            <table width="780" border="0" cellpadding="0" cellspacing="0" nodeIndex="8" class="ril_layoutTable">
                <tr nodeIndex="7">
                    <td colspan="9" nodeIndex="6">                      
                        <table width="780" height="40" border="0" cellpadding="0" cellspacing="0" nodeIndex="11" class="ril_layoutTable">
                            <tr nodeIndex="10">
                                <td width="500" nodeIndex="9">&nbsp;</td>                                   
                                <td width="135" nodeIndex="12">                                     
                                    <a href="/login.html" nodeIndex="80"></a>
                                    <a href="/login.html" nodeIndex="81"></a>                           
                                </td>               
                                <td width="135" nodeIndex="13">&nbsp;</td>          
                                <td nodeIndex="14">&nbsp;</td>              
                            </tr>
                        </table>
                    </td>           
                </tr>
                <tr nodeIndex="16">
                    <td nodeIndex="15"></td>        
                    <td nodeIndex="17" childIsOnlyALink="1">
                        <a href="/index.html" nodeIndex="84"></a>
                    </td>       
                    <td nodeIndex="18" childIsOnlyALink="1">
                        <a href="/history.html" nodeIndex="86"></a>
                    </td>       
                    <td nodeIndex="19" childIsOnlyALink="1">
                        <a href="/faq.html" nodeIndex="88"></a>
                    </td>       
                    <td nodeIndex="20" childIsOnlyALink="1">
                        <a href="/prep.html" nodeIndex="90"></a>
                    </td>       
                    <td nodeIndex="21"></td>        
                    <td nodeIndex="22" childIsOnlyALink="1">
                        <a href="/exercises.html" nodeIndex="93"></a>
                    </td>       
                    <td nodeIndex="23" childIsOnlyALink="1">
                        <a href="/faq.html?contact=true" nodeIndex="95"></a>
                    </td>       
                    <td nodeIndex="24"></td>        
                </tr>
                <tr nodeIndex="26">
                    <td colspan="9" nodeIndex="25"></td>
                </tr>
            </table><!-- End ImageReady Slices -->
        </td>   
        <td width="1" rowspan="3" nodeIndex="27"></td>  
        <td width="50%" rowspan="3" nodeIndex="28">&nbsp;</td>      
    </tr>
    <tr nodeIndex="30">
        <td height="100%" valign="top" nodeIndex="29">  
            <table width="780" border="0" cellpadding="0" cellspacing="0" nodeIndex="33" class="ril_layoutTable">
                <tr nodeIndex="32">
                    <td width="534" valign="top" nodeIndex="31">        
                        <table width="534" border="0" cellpadding="0" cellspacing="0" nodeIndex="36" class="ril_layoutTable">
                            <tr nodeIndex="35">
                                <td width="534" valign="top" class="bgdown" nodeIndex="34">
                                    <table cellspacing="0" cellpadding="0" nodeIndex="39" class="ril_layoutTable">
                                        <tr nodeIndex="38">
                                            <td valign="top" width="508" nodeIndex="37">                                                                    
                                                <!--Begin Content-->
                                                <h2 nodeIndex="40">Welcome to IQTest.com, home of the original  online IQ test.</h2>
                                                <p nodeIndex="41" childIsOnlyALink="1">
                                                    <a href="/prep.html" nodeIndex="100">Click here</a> to take our free, private, and fun IQ test.</p>
                                                <p nodeIndex="42">
                                                    Our original IQ test  is the most scientifically valid IQ test available on 
                                                    the web today. Previously offered only to corporations, schools, and in certified professional applications, it is now available to you. In addition to measuring your general IQ, our exclusive  test  assesses your performance in 13 different areas of intelligence, revealing your key cognizant 
                                                    strengths and weaknesses.</p>
                                                <p nodeIndex="43">
                                                    Developed by PhDs and statistically sound, our  test  reflects the best research available.<br nodeIndex="101">
                                                        <a href="/prep.html" nodeIndex="102">Click here to begin</a>
                                                        <br nodeIndex="103">
                                                            <br nodeIndex="104">
                                                </p>
                                                <h2 nodeIndex="44">
                                                    <a href="/prep.html" nodeIndex="105">IQTest.com<br nodeIndex="106">
                                                            Take the Test</a>
                                                </h2>
                                                <br nodeIndex="107">
                                                    <h2 nodeIndex="45">
                                                        <strong nodeIndex="108">What is an IQ?
                                                        </strong>
                                                    </h2>
                                                    <p nodeIndex="46">An Intelligence Quotient  indicates a person's mental abilities relative to others of approximately the same age. Everyone has hundreds of specific mental 
                                                        abilities--some  can be measured accurately and are reliable predictors of  academic and financial success.</p>
                                                    <p nodeIndex="47">Read more about <a href="whatisaniqscore.html" nodeIndex="109">Intelligence Testing</a></p>
                                                    <!-- End of StatCounter Code -->
                                                    <!--End Content-->
                                                    <br nodeIndex="113">
                                                        <p nodeIndex="48"></p>               
                                            </td>
                                        </tr>
                                    </table><!-- </div> -->
                                </td>
                            </tr>
                            <tr nodeIndex="50">
                                <td nodeIndex="49"></td>
                            </tr>
                        </table>
                    </td>   
                    <!--Begin Sidebar-->
                    <td height="100%" nodeIndex="51">&nbsp;</td>
                    <td width="225" valign="top" nodeIndex="52">
                        <table class="ril_layoutTable" width="225" border="0" cellpadding="0" cellspacing="0" nodeIndex="55">
                            <tr nodeIndex="54">
                                <td nodeIndex="53"></td>
                            </tr>
                            <tr nodeIndex="57">
                                <td width="225" valign="top" nodeIndex="56">            
                                    <h4 nodeIndex="118">What does my score mean?</h4>               
                                    <p nodeIndex="58">Please <a href="whatisaniqscore.html" nodeIndex="119">click here</a> for an explanation of IQ testing and standard deviation.<br nodeIndex="120">
                                            Please <a href="faq.html#chart" nodeIndex="121">click here</a> for a test score comparison chart.<br nodeIndex="122">
                                                Please <a href="history.html" nodeIndex="123">click here</a> for a history of intelligence testing.</p>
                                    <div align="center" margin="0" nodeIndex="59">
                                    </div>
                                </td>               
                            </tr>
                            <tr nodeIndex="61">
                                <td nodeIndex="60"></td>
                            </tr>
                            <tr nodeIndex="63">
                                <td width="225" valign="top" nodeIndex="62">            
                                    <h4 nodeIndex="127">What is the Complete Personal Intelligence Profile?</h4>                
                                    <p nodeIndex="64">Your Complete Personal Intelligence Profile will give you much greater detail about the range and variety of your mental abilities. <a href="profileexplain.html" nodeIndex="128">Read More...</a></p>                    
                                </td>               
                            </tr>
                            <tr nodeIndex="66">
                                <td nodeIndex="65"></td>
                            </tr>
                            <tr nodeIndex="68">
                                <td width="225" valign="top" nodeIndex="67">    
                                    <h4 nodeIndex="130">Consciousness Exercises</h4>    
                                    <p nodeIndex="69">The Consciousness Exercises are a set of entertaining psycho-spiritual games, puzzles, dialogs, and more, which can expand your awareness. <a href="exercises.html" nodeIndex="131">Read More...</a></p>                      
                                </td>
                            </tr>
                            <tr nodeIndex="71">
                                <td nodeIndex="70"></td>
                            </tr>
                        </table>
                    </td>       
                    <!--End Sidebar-->
                </tr>
            </table>
        </td>   
    </tr>
    <tr nodeIndex="73">
        <td nodeIndex="72">
            <table width="780" border="0" cellpadding="0" cellspacing="0" nodeIndex="76" class="ril_layoutTable">
                <tr nodeIndex="75">
                    <td width="780" height="33" align="center" nodeIndex="74">
                        <a href="/index.html" nodeIndex="133">Home</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
                        <a href="/history.html" nodeIndex="134">History</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
                        <a href="/faq.html" nodeIndex="135">FAQ</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
                        <a href="/prep.html" nodeIndex="136">Test</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
                        <a href="/exercises.html" nodeIndex="137">Consciousness Exercises</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
                        <a href="/faq.html?contact=true" nodeIndex="138">Contact Us</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
                        <a href="/privacy.html" nodeIndex="139">Privacy Policy</a>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
                        <a href="/remove.html" nodeIndex="140">Unsubscribe</a>
                    </td>
                </tr>
                <tr nodeIndex="78">
                    <td width="780" height="34" align="center" nodeIndex="77">&copy; 2003 -2011 Autumn Group. All rights reserved</td>
                </tr>
            </table>
        </td>   
    </tr>

0 个答案:

没有答案