我正在解析此网页细分:
<tr valign="middle">
<td class="inner"><span style=""><span class="" title=""></span> 2 <span class="icon ok" title="Verified"></span> </span><span class="icon cat_tv" title="Video » TV" style="bottom:-2;"></span> <a href="/VALUE.html" style="line-height:1.4em;">VALUE</a> </td>
<td width="1%" align="center" nowrap="nowrap" class="small inner" >VALUE</td>
<td width="1%" align="right" nowrap="nowrap" class="small inner" >VALUE</td>
<td width="1%" align="center" nowrap="nowrap" class="small inner" >VALUE</td>
</tr>
我在变量电视中有这个片段:HtmlElement tv = tr.get(i);
我以这种方式阅读了标记<a href="/VALUE.html" style="line-height:1.4em;">VALUE</a>
:
HtmlElement a = tv.getElementsByTagName("a").get(0);
object.name.value(a.getTextContent());
url = a.getAttribute("href");
object.url_detail.value(myBase + url);
如何只阅读其他<td>....</td>
部分的VALUE字段?
答案 0 :(得分:5)
我建议使用XPath,这是解析XML / HTML的推荐方法
参考:How to read XML using XPath in Java
另请看一下这个问题:RegEx match open tags except XHTML self-contained tags
<强>更新强>
如果我理解正确,你需要每个td的“VALUE”,对吧? 如果是这样,你的XPath会是这样的:
//td[@class="small inner"]/text()
答案 1 :(得分:1)
您可以尝试一个精彩的java包jsoup。
更新:使用包,你可以解决这个问题:
String html = "<tr valign=\"middle\">"
+ " <td class=\"inner\">"
+ " <span style=\"\"><span class=\"\" title=\"\"></span> 2 <span class=\"icon ok\" title=\"Verified\"></span> </span><span class=\"icon cat_tv\" title=\"Video » TV\" style=\"bottom:-2;\"></span>"
+ " <a href=\"/VALUE.html\" style=\"line-height:1.4em;\">VALUE</a> "
+ " </td>"
+ " <td width=\"1%\" align=\"center\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>"
+ " <td width=\"1%\" align=\"right\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>"
+ " <td width=\"1%\" align=\"center\" nowrap=\"nowrap\" class=\"small inner\" >VALUE</td>"
+ "</tr>";
Document doc = Jsoup.parse(html, "", Parser.xmlParser());
Elements labelPLine = doc.select("a[href]");
System.out.println("value 1:" + labelPLine.text());
Elements labelPLine2 = doc.select("td[width=1%");
Iterator<Element> it = labelPLine2.iterator();
int n = 2;
while (it.hasNext()) {
System.out.println("value " + (n++) + ":" + it.next().text());
}
结果将是:
value 1:VALUE value 2:VALUE value 3:VALUE value 4:VALUE