使用jsoup提取文本的某些部分

时间:2013-05-30 12:14:56

标签: java jsoup

我的网页包含几个类似的结构,如下所示:

<tr>
<td width="10%" bgcolor="#FFFFFF"><font class="bodytext9">1-Jun-2013</font></td>
<td width="4%" bgcolor="#FFFFFF" align=center><font class="bodytext9">Sat</font></td>
<td width="5%" bgcolor="#FFFFFF" align="center"></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9">Another Text</font></td>
<td width="5%" bgcolor="#FFFFFF" align="center"><font class="bodytext9"><img src="img/colors/white.gif"></font></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a class="black_9" href="link2">Here is also Text</a></td>
<td width="15%" bgcolor="#FFFFFF" align="center"><a href="LINKtoWeb" class=list><u>STRING TO CAPTURE</u></a></td>
<td width="4%" bgcolor="#FFFFFF" align="center"><a target="_new" href="AnotherLink"><img src="img/img2.gif" border="0"></a></td>
</tr>

这种结构在内部用不同的文本重复了很多次,但我只想提取这个结构,因为文本“STRING TO CAPTURE”出现在这里的第一时间。那么我如何使用Jsoup只提取这个集合,以及它之间的可见文本,以及url

AnotherLink

出现在“STRING TO CAPTURE”字样的行? 我是Jsoup的新手,所以我只尝试了这个

  Document doc = Jsoup.connect("http://www.website.com").get();

Element link = doc.select("a").first();
String relHref = link.attr("href"); 
String absHref = link.attr("abs:href"); 
String text = doc.body().text();
String linkHref = link.attr("href"); 
String linkText = link.text(); 

  System.out.println("link:" + link);
  System.out.println("text:" + text);

但为此目的不能提前做,请给我一些建议!谢谢!

1 个答案:

答案 0 :(得分:1)

使用此测试输入:

String test = "<html><body><table>";
test += "<tr>";
test += "<td width=\"10%\" bgcolor=\"#FFFFFF\"><font class=\"bodytext9\">1-Jun-2013</font></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=center><font class=\"bodytext9\">Sat</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\">Another Text</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\"><img src=\"img/colors/white.gif\"></font></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a class=\"black_9\" href=\"link2\">Here is also Text</a></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a href=\"LINKtoWeb\" class=list><u>TEXT THAT DOESN'T MATCH</u></a></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=\"center\"><a target=\"_new\" href=\"NotMatchLink\"><img src=\"img/img2.gif\" border=\"0\"></a></td>";
test += "</tr>";
test += "<tr>";
test += "<td width=\"10%\" bgcolor=\"#FFFFFF\"><font class=\"bodytext9\">1-Jun-2013</font></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=center><font class=\"bodytext9\">Sat</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\">Another Text</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\"><img src=\"img/colors/white.gif\"></font></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a class=\"black_9\" href=\"link2\">Here is also Text</a></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a href=\"LINKtoWeb\" class=list><u>STRING TO CAPTURE</u></a></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=\"center\"><a target=\"_new\" href=\"AnotherLink\"><img src=\"img/img2.gif\" border=\"0\"></a></td>";
test += "</tr>";
test += "<tr>";
test += "<td width=\"10%\" bgcolor=\"#FFFFFF\"><font class=\"bodytext9\">1-Jun-2013</font></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=center><font class=\"bodytext9\">Sat</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\">Another Text</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\"><img src=\"img/colors/white.gif\"></font></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a class=\"black_9\" href=\"link2\">Here is also Text</a></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a href=\"LINKtoWeb\" class=list><u>MORE TEXT THAT DOESN'T MATCH</u></a></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=\"center\"><a target=\"_new\" href=\"NotMatchLink\"><img src=\"img/img2.gif\" border=\"0\"></a></td>";
test += "</tr>";
test += "<tr>";
test += "<td width=\"10%\" bgcolor=\"#FFFFFF\"><font class=\"bodytext9\">1-Jun-2013</font></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=center><font class=\"bodytext9\">Sat</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\">Another Text</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\"><img src=\"img/colors/white.gif\"></font></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a class=\"black_9\" href=\"link2\">Here is also Text</a></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a href=\"LINKtoWeb\" class=list><u>STILL MORE TEXT THAT DOESN'T MATCH</u></a></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=\"center\"><a target=\"_new\" href=\"NotMatchLink\"><img src=\"img/img2.gif\" border=\"0\"></a></td>";
test += "</tr>";
test += "</table></body></html>";
test += "<td width=\"10%\" bgcolor=\"#FFFFFF\"><font class=\"bodytext9\">Second 1-Jun-2013</font></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=center><font class=\"bodytext9\">Second Sat</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\">Second Another Text</font></td>";
test += "<td width=\"5%\" bgcolor=\"#FFFFFF\" align=\"center\"><font class=\"bodytext9\"><img src=\"img/colors/white.gif\"></font></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a class=\"black_9\" href=\"link2\">Second Here is also Text</a></td>";
test += "<td width=\"15%\" bgcolor=\"#FFFFFF\" align=\"center\"><a href=\"LINKtoWeb\" class=list><u>STRING TO CAPTURE</u></a></td>";
test += "<td width=\"4%\" bgcolor=\"#FFFFFF\" align=\"center\"><a target=\"_new\" href=\"SecondAnotherLink\"><img src=\"img/img2.gif\" border=\"0\"></a></td>";
test += "</tr>";

这段代码:

final Document document = Jsoup.parse(test);
final Element entireRow = document.select("tr:contains(STRING TO CAPTURE)").get(0);
for (final Element column : entireRow.select("td")) {
    System.out.println("Column text is: " + column.text());
}
final Elements link = entireRow.select("td:contains(STRING TO CAPTURE) + td > a[href]");
System.out.println("Target link is: " + link.attr("href"));

输出:

Column text is: 1-Jun-2013
Column text is: Sat
Column text is: 
Column text is: Another Text
Column text is: 
Column text is: Here is also Text
Column text is: STRING TO CAPTURE
Column text is: 
Target link is: AnotherLink