Java使用Jsoup从网站读取信息

时间:2014-02-06 09:48:23

标签: java parsing jsoup

我已经阅读了很多关于解析等的帖子。我看到的大多数回复都建议这个人使用图书馆或其他东西。我现在的问题是创建一个算法来获取我想要的确切信息。我的目的是从Weather网站获取2个状态以便学校关闭。我开始使用Jsoup作为推荐的人,但我需要帮助。

网页:Click here

图片:Click here

网页来源示例:click here

我可能想知道如何在网页中获取特定的文本行,因为我已经知道了我正在寻找的学校的名称,但是2行是我所需要的状态。如果每所学校都有一定的地位但是他们都是封闭的或两小时的延迟,那么我就不会只是搜索它。我想要一些关于如何使用这个的想法或答案。我打算这样做2次因为我想要查找2所学校。我已经有了可以用来查找它们的名字我只需要状态。

以下是我想要做的一个例子。 (sudo代码)

Document doc = connect(to url);
Element schoolName1 = doc.lookForText(htmlLineHere/schoolname);

String status1 = schoolName.getNext().text();//suppose this gets the line right after which should be my status and then cleans off the Html.

这就是我现在所拥有的

public static SchoolClosing lookupDebug() throws IOException {
        final ArrayList<String> Status = new ArrayList<String>();

        try {
            //connects to my wanted website
            Document doc = Jsoup.connect("http://www.10tv.com/content/sections/weather/closings.html").get();
            //selects/fetches the line of code I want
            Element schoolName = doc.html("<td valign="+"top"+">Athens City Schools</td>");
            //an array of Strings where I am going to add the text I need when I get it
            final ArrayList<String> temp = new ArrayList<String>();
            //checking if its fetching the text
            System.out.println(schoolName.text());
            //add the text to the array
            temp.add(schoolName.text());
            for (int i = 0; i <= 1; i++) {
                final String[] tempStatus = temp.get(i).split(" ");
                Status.add(tempStatus[0]);
            }
        } catch (final IOException e) {
            throw new IOException("There was a problem loading School Closing Status");
        }
        return new SchoolClosing(Status);
    }

1 个答案:

答案 0 :(得分:2)

Document doc = Jsoup.connect(
        "http://www.10tv.com/content/sections/weather/closings.html")
        .get();
for (Element tr : doc.select("#closings tr")) {
    Element tds = tr.select("td").first();
    if (tds != null) {
        String county = tr.select("td:eq(0)").text();
        String schoolName = tr.select("td:eq(1)").text();
        String status = tr.select("td:eq(2)").text();
        System.out.println(String.format(
                "county: %s, schoolName: %s, status: %s", county,
                schoolName, status));
    }
}

输出:

county: Athens, schoolName: Beacon School, status: Two-hour Delay
county: Franklin, schoolName: City of Grandview Heights, status: Snow Emergency through 8pm Thursday
county: Franklin, schoolName: Electrical Trades Center, status: All Evening Activities Cancelled
county: Franklin, schoolName: Hilock Fellowship Church, status: PM Services Cancelled
county: Franklin, schoolName: International Christian Center, status: All Evening Activities Cancelled
county: Franklin, schoolName: Maranatha Baptist Church, status: PM Services Cancelled
county: Franklin, schoolName: Masters Commission New Covenant Church, status: Bible Study Cancelled
county: Franklin, schoolName: New Life Christian Fellowship, status: All Activities Cancelled
county: Franklin, schoolName: The Epilepsy Foundation of Central Ohio, status: All Evening Activities Cancelled
county: Franklin, schoolName: Washington Ave United Methodist Church, status: All Evening Activities Cancelled

或循环:

for (Element tr : doc.select("#closings tr")) {
    System.out.println("----------------------");
    for (Element td : tr.select("td")) {
        System.out.println(td.text());
    }
}

给出:

----------------------
Athens
Beacon School
Two-hour Delay
----------------------
Franklin
City of Grandview Heights
Snow Emergency through 8pm Thursday
----------------------
Franklin
Electrical Trades Center
All Evening Activities Cancelled
...