Question

我想使用 jsoup 从网址List of cities and towns in India中提取所有城市名称和州名称，该页面的HTML代码段如下所示。

这里 Abhaypuri 是城市名称，阿萨姆是州名。类似的城市和州名称在页面中多次出现在这样的表格结构中，出现数千次，除了td标签内的 url 之外，其他地方都是相同的。

<table class="wikitable sortable" style="text-align:;">
<tr>
<th>Name of City/Town</th>
<th>Name of State</th>
<th>Classification<pre><code></th>
<th>Population (2001)<pre><code></th>
<th>Population (2011)<pre><code></th>
</tr>
<tr>
<td><pre><code><a href="/wiki/Abhayapuri" title="Abhayapuri">Abhayapuri<pre><code></a><pre><code></td>
<td><pre><code><a href="/wiki/Assam" title="Assam">Assam<pre><code></a><pre><code></td>

我是jsoup的新手。任何帮助，将不胜感激。谢谢。

Answer 1

示例代码：

    Document root = Jsoup.parse(new URL("http://en.wikipedia.org/wiki/List_of_cities_and_towns_in_India"), 30000);
    //find all tables
    Elements tables = root.select("table");
    for (int m = 0; m < tables.size(); m++) {
        final Element table = tables.get(m);
        Elements th0 = table.select("tbody tr th");
        //find our tables
        if (th0 != null && th0.get(0).text().trim().equals("Name of City/Town")) {
            Elements es = table.select("tbody tr");
            for (int i = 1; i < es.size(); i++) {
                Elements td = es.get(i).select("td");
                String city = td.get(0).select("a").first().text();
                String state = td.get(1).select("a").first().text();
                System.out.println(city + " => " + state);
            }
        }
    }

输出：

Abhayapuri => Assam
Achabbal => Jammu and Kashmir
Achalpur => Maharashtra
Achhnera => Uttar Pradesh
Adari => Uttar Pradesh
Adalaj => Gujarat
Adilabad => Andhra Pradesh
Adityana => Gujarat
pereyaapatna => Karnataka
Adoni => Andhra Pradesh
Adoor => Kerala
Adyar => Karnataka
Adra => West Bengal
Afzalpura => Karnataka
Agartala => Tripura

从具有特定类的页面的所有表中的标签中提取数据

1 个答案: