我想使用 jsoup 从网址List of cities and towns in India中提取所有城市名称和州名称,该页面的HTML代码段如下所示。
这里 Abhaypuri 是城市名称,阿萨姆是州名。类似的城市和州名称在页面中多次出现在这样的表格结构中,出现数千次,除了td标签内的 url 之外,其他地方都是相同的。
<table class="wikitable sortable" style="text-align:;">
<tr>
<th>Name of City/Town</th>
<th>Name of State</th>
<th>Classification<pre><code></th>
<th>Population (2001)<pre><code></th>
<th>Population (2011)<pre><code></th>
</tr>
<tr>
<td><pre><code><a href="/wiki/Abhayapuri" title="Abhayapuri">Abhayapuri<pre><code></a><pre><code></td>
<td><pre><code><a href="/wiki/Assam" title="Assam">Assam<pre><code></a><pre><code></td>
我是jsoup的新手。任何帮助,将不胜感激。谢谢。
答案 0 :(得分:2)
示例代码:
Document root = Jsoup.parse(new URL("http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_India"), 30000);
//find all tables
Elements tables = root.select("table");
for (int m = 0; m < tables.size(); m++) {
final Element table = tables.get(m);
Elements th0 = table.select("tbody tr th");
//find our tables
if (th0 != null && th0.get(0).text().trim().equals("Name of City/Town")) {
Elements es = table.select("tbody tr");
for (int i = 1; i < es.size(); i++) {
Elements td = es.get(i).select("td");
String city = td.get(0).select("a").first().text();
String state = td.get(1).select("a").first().text();
System.out.println(city + " => " + state);
}
}
}
输出:
Abhayapuri => Assam
Achabbal => Jammu and Kashmir
Achalpur => Maharashtra
Achhnera => Uttar Pradesh
Adari => Uttar Pradesh
Adalaj => Gujarat
Adilabad => Andhra Pradesh
Adityana => Gujarat
pereyaapatna => Karnataka
Adoni => Andhra Pradesh
Adoor => Kerala
Adyar => Karnataka
Adra => West Bengal
Afzalpura => Karnataka
Agartala => Tripura