使用Android中的JSoup从表格中正确提取信息

时间:2016-09-14 22:56:05

标签: android web-scraping jsoup

我正在尝试从HTML表中提取一些信息,并将它们放到arraylist = new ArrayList<HashMap<String, String>>();中,以便在我的应用内更好地管理。

在发布请求后,我已经能够在document变量中保存正确的HTML页面。 以下是包含我的有用数据的HTML,但它不是页面中唯一的表。我不知道如何在这个特定的表格中找到项目。

以这种格式获取数据的正确方法是什么:DAY - TIME - SUGGESTION

非常感谢您提前提出任何建议!

<table><tbody>
<tr><th class="date">Wed, 14 Sep 2016</th><th></th><th></th></tr>
<tr><td>&nbsp;</td><td class="sub">09:00</td><td class="sugg">Depart and set your watch to the arrival city&#39;s time zone (03:00). Sleep as needed. The following times are in the arrival city&#39;s time zone.</td></tr>
<tr><td>&nbsp;</td><td class="sub">18:30</td><td class="sugg">Arrive</td></tr>
<tr><td>&nbsp;</td><td class="sub">19:00&ndash;22:00</td><td class="sugg">Seek light</td></tr>
<tr><td>&nbsp;</td><td class="sub">22:00&ndash;23:00</td><td class="sugg">Avoid light before bed</td></tr>
<tr><td>&nbsp;</td><td class="sub">23:00&ndash;07:00</td><td class="sugg">Sleep ideal</td></tr>
<tr><th class="date">Thu, 15 Sep 2016</th><th></th><th></th></tr>
<tr><td>&nbsp;</td><td class="sub">20:00&ndash;23:00</td><td class="sugg">Seek light before bed</td></tr>
<tr><td>&nbsp;</td><td class="sub">23:00&ndash;07:00</td><td class="sugg">Sleep ideal</td></tr>
<tr><th class="date">Fri, 16 Sep 2016</th><th></th><th></th></tr>
<tr><td>&nbsp;</td><td class="sub">20:00&ndash;23:00</td><td class="sugg">Seek light before bed</td></tr>
<tr><td>&nbsp;</td><td class="sub">23:00&ndash;07:00</td><td class="sugg">Sleep ideal</td></tr>
</tbody></table>

修改

我认为循环是我想要实现的方式。我越来越接近解决方案了。我需要找到一种方法来检测我在循环中检查的当前行是否有th或td单元格:

//find the table, it is the second table in the HTML
Element table = document.select("tbody").get(1);

//get all the rows
Elements rows = table.select("tr");

//loop the rows
for (Element row : rows) {

    //if the row contains th, I get the first cell and save day in a string

    //if the row contains td, I get the second (time) and third (suggestion) cells and put in my map string with day, time, suggestion

}

2 个答案:

答案 0 :(得分:1)

所以你有两个选择,你可以利用css选择器按类拉出所有元素。

https://try.jsoup.org/

或者你可以遍历元素。

Document doc = Jsoup.connect(url).get();
Element div = doc.select("tbody").first();
 for (Element element : div.children()) {
    //do stuff here
}

答案 1 :(得分:0)

嗯,我想出了一个解决方案,也许不是最好的样式编码,但它有效:)(工程师:“如果它有效,那就很好”)

我对某些语言的编码有一定的了解,但这是我第一次处理解析并因此处理JSoup。它不是一个理解的直接工具,但在我的研究中,我注意到它非常强大。我把它放在我个人的学习清单中。

注意:这种方法假设在td行之前总是存在第n行。

这是我的解决方案:

        String day = null;
        String time;
        String sugg;

        //crop the page in order to leave the table I needed, since it was without a specific id, I selected it as the second table in the page
        Element table = document.select("tbody").get(1);

        //this is the list of all the row in the table
        Elements rows = table.select("tr");

        //here I cycle the rows
        for (Element row : rows) {

            HashMap<String, String> map = new HashMap<String, String>();


            //if the row contains th elements, I store the first th of the row as day
            if (!row.select("th").isEmpty())
            {
                day = row.select("th").get(0).text();
            }

            //if the row contains td elements, I store the second and third td in strings and put all in map
            if (!row.select("td").isEmpty())
            {
                time = row.select("td").get(1).text();
                sugg = row.select("td").get(2).text();

                Log.d("row: ", day + " " + time + " " + sugg);

                map.put("day", day);
                map.put("time", time);
                map.put("sugg", sugg);
            }

            arraylist.add(map);
        }