如果找到元素,则停止Jtidy解析

时间:2014-07-12 07:50:02

标签: java xml xml-parsing jtidy

有没有办法只下载和解析XML文档,直到找到使用XPathExpression的元素?我正在使用Java:

url = new URL("http://registroapps.uniandes.edu.co/scripts/adm_con_horario1_joomla.php?depto="+params[0]);
        Tidy tidy = new Tidy();
        tidy.setQuiet(true);
        tidy.setXHTML(true);    
        tidy.setShowWarnings(false);
        Document doc = tidy.parseDOM(url.openStream(), System.out);

        // Use XPath to obtain whatever you want from the (X)HTML
        XPath xpath = XPathFactory.newInstance().newXPath();
        XPathExpression expr = xpath.compile("//tr[td[normalize-space(font) = '"+params[1]+"']]/td/font/text()");
        NodeList result = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);

我从这样的HTML文档中获取文本:

<table width="575" border="0" cellspacing="1" cellpadding="0">
                <tr> 
                  <td width="39" class="back1"><b class="texto4">CRN</b></td>
                  <td width="60" class="back1"><b class="texto4">Materia</b></td>
                  <td width="53" class="back1"><b class="texto4">Secci&oacute;n</b></td>
                  <td width="55" class="back1"><b class="texto4">Cr&eacute;ditos</b></td>
                  <td width="156" class="back1"><b class="texto4">T&iacute;tulo</b></td>
                  <td width="69" class="back1"><b class="texto4">Cupo</b></td>
                  <td width="57" class="back1"><b class="texto4">Inscritos</b></td>
                  <td width="77" class="back1"><b class="texto4">Disponible</b></td>
                </tr>
                <tr> 
                  <td width="39"><font class="texto4"> 
                    10110                        </font></td>
                  <td width="60"><font class="texto4"> 
                    IIND1000                        </font></td>
                  <td width="53"><font class="texto4"> 
                  <div align="center">
                    1                        </div></font></td>
                  <td width="55"><font class="texto4"> 
                    <div align="center">
                    3                       </div>
                    </font></td>
                  <td width="156"><font class="texto4"> 
                    INTROD. INGEN. INDUSTRIAL                        </font></td>
                  <td width="69"><font class="texto4"> 
                    100                        </font></td>
                  <td width="57"><font class="texto4"> 
                    100                        </font></td>
                  <td width="77"><font class="texto4"> 
                    0                        </font></td>
                </tr>
              </table>
<tr> 
            <td> 
              <table width="550" border="0" cellspacing="1" cellpadding="0">
                <tr> 
                  <td width="81" >&nbsp;</td>
                  <td width="172" class="back3" height="17"><b class="texto4">D&iacute;as</b></td>
                  <td width="171" class="back3" height="17"><b class="texto4">Horas</b></td>
                  <td width="171" class="back3" height="17"><b class="texto4">Sal&oacute;n</b></td>
                  <td width="171" class="back3"><b class="texto4">F. Inicial</b></td>
                  <td width="171" class="back3"><b class="texto4">F. Final</b></td>
                </tr>
                                    <tr> 
                  <td width="81" >&nbsp;</td>
                  <td width="172" height="17"><font class="texto4"> 
                        I                                </font></td>
                  <td width="171" height="17"><font class="texto4" > 
                    0700 - 0820                        </font></td>
                  <td width="171" height="17"><font class="texto4"> 
                    - -                        </font></td>
                  <td width="171"><font class="texto4" >28-JUL-14</font></td>
                  <td width="171"><font class="texto4" >15-NOV-14</font></td>
                </tr>
                                    <tr> 
                  <td width="81" ><div align="right"><span class="back3"><font class="texto4"><strong>Instructor(es)</strong>:</font></span></div></td>
                  <td width="172"  class="back3" height="17"><font class="texto4"><font class="texto4"> 
                    ALDANA VALDES EDUARDO                         </font></font></td>
                  <td width="171"  class="back3" height="17"><font class="texto4"> 
                                            </font></td>
                  <td width="171"  class="back3" height="17"><font class="texto4"></font></td>
                  <td width="171"  class="back3">&nbsp;</td>
                  <td width="171"  class="back3">&nbsp;</td>
                </tr>
              </table>                </td>
          </tr>

因此,例如,一旦XPathExpression在第一个表上找到代码10110(params[1]=10110),那么我需要它不下载下一个表。相反,只有来自子节点的所有文本相同的级别。通常的文档大小超过10k行,如果搜索的元素在最开始,它会在一段时间后效率低下。

0 个答案:

没有答案