有没有办法只下载和解析XML文档,直到找到使用XPathExpression的元素?我正在使用Java:
url = new URL("http://registroapps.uniandes.edu.co/scripts/adm_con_horario1_joomla.php?depto="+params[0]);
Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setXHTML(true);
tidy.setShowWarnings(false);
Document doc = tidy.parseDOM(url.openStream(), System.out);
// Use XPath to obtain whatever you want from the (X)HTML
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("//tr[td[normalize-space(font) = '"+params[1]+"']]/td/font/text()");
NodeList result = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
我从这样的HTML文档中获取文本:
<table width="575" border="0" cellspacing="1" cellpadding="0">
<tr>
<td width="39" class="back1"><b class="texto4">CRN</b></td>
<td width="60" class="back1"><b class="texto4">Materia</b></td>
<td width="53" class="back1"><b class="texto4">Sección</b></td>
<td width="55" class="back1"><b class="texto4">Créditos</b></td>
<td width="156" class="back1"><b class="texto4">Título</b></td>
<td width="69" class="back1"><b class="texto4">Cupo</b></td>
<td width="57" class="back1"><b class="texto4">Inscritos</b></td>
<td width="77" class="back1"><b class="texto4">Disponible</b></td>
</tr>
<tr>
<td width="39"><font class="texto4">
10110 </font></td>
<td width="60"><font class="texto4">
IIND1000 </font></td>
<td width="53"><font class="texto4">
<div align="center">
1 </div></font></td>
<td width="55"><font class="texto4">
<div align="center">
3 </div>
</font></td>
<td width="156"><font class="texto4">
INTROD. INGEN. INDUSTRIAL </font></td>
<td width="69"><font class="texto4">
100 </font></td>
<td width="57"><font class="texto4">
100 </font></td>
<td width="77"><font class="texto4">
0 </font></td>
</tr>
</table>
<tr>
<td>
<table width="550" border="0" cellspacing="1" cellpadding="0">
<tr>
<td width="81" > </td>
<td width="172" class="back3" height="17"><b class="texto4">Días</b></td>
<td width="171" class="back3" height="17"><b class="texto4">Horas</b></td>
<td width="171" class="back3" height="17"><b class="texto4">Salón</b></td>
<td width="171" class="back3"><b class="texto4">F. Inicial</b></td>
<td width="171" class="back3"><b class="texto4">F. Final</b></td>
</tr>
<tr>
<td width="81" > </td>
<td width="172" height="17"><font class="texto4">
I </font></td>
<td width="171" height="17"><font class="texto4" >
0700 - 0820 </font></td>
<td width="171" height="17"><font class="texto4">
- - </font></td>
<td width="171"><font class="texto4" >28-JUL-14</font></td>
<td width="171"><font class="texto4" >15-NOV-14</font></td>
</tr>
<tr>
<td width="81" ><div align="right"><span class="back3"><font class="texto4"><strong>Instructor(es)</strong>:</font></span></div></td>
<td width="172" class="back3" height="17"><font class="texto4"><font class="texto4">
ALDANA VALDES EDUARDO </font></font></td>
<td width="171" class="back3" height="17"><font class="texto4">
</font></td>
<td width="171" class="back3" height="17"><font class="texto4"></font></td>
<td width="171" class="back3"> </td>
<td width="171" class="back3"> </td>
</tr>
</table> </td>
</tr>
因此,例如,一旦XPathExpression在第一个表上找到代码10110(params[1]=10110)
,那么我需要它不下载下一个表。相反,只有来自子节点的所有文本相同的级别。通常的文档大小超过10k行,如果搜索的元素在最开始,它会在一段时间后效率低下。