jsoup通过选择器获取元素内部文本

时间:2015-02-17 20:29:15

标签: css-selectors html-parsing jsoup

我正在寻求帮助,在jsoup中使用pattern for selector 基本上我正在修改别人的代码以满足我的需要

例如对于href,就像这样完成

Elements links = doc.select("a[href]");
for (Element link : links) {
    // get the value from href attribute
    System.out.println("\nlink : " + link.attr("href"));
    System.out.println("text : " + link.text());
}

我在这里指的是但不确定使用哪一个 http://jsoup.org/apidocs/org/jsoup/select/Selector.html

我想找到像"运行地图任务,1"等等

<hr>
<h2>Cluster Summary (Heap Size is 555 MB/26.6 GB)</h2>
<table border="1" cellpadding="5" cellspacing="0">
<tr><th>Running Map Tasks</th><th>Running Reduce Tasks</th><th>Total Submissions</th><th>Nodes</th><th>Occupied Map Slots</th><th>Occupied Reduce Slots</th><th>Reserved Map Slots</th><th>Reserved Reduce Slots</th><th>Map Task Capacity</th><th>Reduce Task Capacity</th><th>Avg. Tasks/Node</th><th>Blacklisted Nodes</th><th>Excluded Nodes</th><th>MapTask Prefetch Capacity</th></tr>
<tr><td>1</td><td>0</td><td>5576</td><td><a href="machines.jsp?type=active">8</a></td><td>1</td><td>0</td><td>0</td><td>0</td><td>352</td><td>128</td><td>60.00</td><td><a href="machines.jsp?type=blacklisted">0</a></td><td><a href="machines.jsp?type=excluded">0</a></td><td>0</td></tr></table>
<br>
<hr>

如何在所有标签内获取文字?

我还应该寻找像#34; Cluster Summary&#34;所以我可以在我的其余网址

中使用或相应地这样做
<h2 id="running_jobs">Running Jobs</h2>
<table border="1" cellpadding="5" cellspacing="0">
<thead><tr><th><b>Jobid</b></th><th><b>Priority</b></th><th><b>User</b></th><th><b>Name</b></th><th><b>Start Time</b></th><th><b>Map % Complete</b></th><th><b>Current Map Slots</b></th><th><b>Failed MapAttempts</b></th><th><b>MapAttempt Time Avg/Max</b></th><th><b>Cumulative Map CPU</b></th><th><b>Current Map PMem</b></th><th><b>Reduce % Complete</b></th><th><b>Current Reduce Slots</b></th><th><b>FailedReduce Attempts</b></th><th><b>ReduceAttempt Time Avg/Max</b></th><th><b>Cumulative Reduce CPU</b></th><th><b>Current Reduce PMem</b></th></tr>
</thead><tbody><tr><td id="job_0"><a href="jobdetails.jsp?jobid=job_201502130313_1511&refresh=30">job_201502130313_1511</a></td><td id="priority_0">NORMAL</td><td id="user_0">vdeadmin</td><td id="name_0">streamjob1942665573586845283.jar</td><td>Fri Feb 13 17:00:17 PST 2015</td><td>0.00%<table border="1px" width="80px"><tr><td cellspacing="0" class="perc_nonfilled" width="100%"></td></tr></table></td><td><a href="jobtasks.jsp?jobid=job_201502130313_1511&type=map&pagenum=1&state=running">1</a></td><td>0</td><td>0sec/0sec</td><td>1hrs, 30mins, 4sec</td><td>703.48 MB</td><td>0.00%<table border="1px" width="80px"><tr><td cellspacing="0" class="perc_nonfilled" width="100%"></td></tr></table></td><td>0</td><td>0</td><td>0sec/0sec</td><td>0sec</td><td> 0 KB</td></tr>

问题的更新/补充 我的URL将包含长HTML,我应该能够搜索特定的组。我的意思是我的搜索应该是逐块...我不想从html中找到所有tr ...但特定于一个表等等 例如在下面,我试图显示来自id =&#34;运行工作&#34;的结果只有,然后为其他一些集。这样做我不应该从html的其他部分得到结果

<h2 id="running_jobs">Running Jobs</h2>
<table border="1" cellpadding="5" cellspacing="0">
<thead><tr><th><b>Jobid</b></th><th><b>Priority</b></th><th><b>User</b></th><th><b>Name</b></th><th><b>Start Time</b></th><th><b>Map % Complete</b></th><th><b>Current Map Slots</b></th><th><b>Failed MapAttempts</b></th><th><b>MapAttempt Time Avg/Max</b></th><th><b>Cumulative Map CPU</b></th><th><b>Current Map PMem</b></th><th><b>Reduce % Complete</b></th><th><b>Current Reduce Slots</b></th><th><b>FailedReduce Attempts</b></th><th><b>ReduceAttempt Time Avg/Max</b></th><th><b>Cumulative Reduce CPU</b></th><th><b>Current Reduce PMem</b></th></tr>
</thead><tbody><tr><td id="job_0"><a href="jobdetails.jsp?jobid=job_201502130313_1511&refresh=30">job_201502130313_1511</a></td><td id="priority_0">NORMAL</td><td id="user_0">vdeadmin</td><td id="name_0">streamjob1942665573586845283.jar</td><td>Fri Feb 13 17:00:17 PST 2015</td><td>0.00%<table border="1px" width="80px"><tr><td cellspacing="0" class="perc_nonfilled" width="100%"></td></tr></table></td><td><a href="jobtasks.jsp?jobid=job_201502130313_1511&type=map&pagenum=1&state=running">1</a></td><td>0</td><td>0sec/0sec</td><td>1hrs, 30mins, 4sec</td><td>703.48 MB</td><td>0.00%<table border="1px" width="80px"><tr><td cellspacing="0" class="perc_nonfilled" width="100%"></td></tr></table></td><td>0</td><td>0</td><td>0sec/0sec</td><td>0sec</td><td> 0 KB</td></tr>
</tbody></table>

1 个答案:

答案 0 :(得分:0)

您应该知道的是使用它的CSS selectorshow是什么 在你的情况下,将文本放在所有&#34; tr th&#34;标签你应该使用以下代码:

Elements trThs = doc.select("tr th");
for(Element trTh : trThs)
    System.out.println("text : " + trTh.text());