解析HTML网页

时间:2013-04-13 22:31:12

标签: java html-parsing jsoup

我使用JSoup解析来自此网站的数据:

http://www.skore.com/en/soccer/england/premier-league/results/all/

我得到球队和结果的名字,我还需要获得得分手的名字(这是结果)。

我正在尝试但是遇到麻烦,因为它不在HTML中。

有可能吗?如果是的话怎么样?

1 个答案:

答案 0 :(得分:3)

在AJAX请求之后(当您点击分数链接时发生)获取记分员信息。您必须提出此类请求并解析结果。

对于instnace,参加第一场比赛(曼联1x2曼城),其标签是:

<a data-y="r1-1229442" data-v="england-premierleague-manchesterunited-manchestercity-13april2013" style="cursor: pointer;">1 - 2</a>

选择data-y,删除前导r并向以下网址发送获取请求:

http://www.skore.com/en/scores/soccer/id/<DATA-Y_HERE>?fmt=html

例如:http://www.skore.com/en/scores/soccer/id/1-1229442?fmt=html。然后解析结果。

完整的工作示例:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class ParseScore {

    public static void main(String[] args) throws Exception {
        Document doc = Jsoup.connect("http://www.skore.com/en/soccer/england/premier-league/results/all/").get();
        System.out.println("title: " + doc.title());

        Elements dls = doc.select("dl");

        for (Element link : dls) {
            String id = link.attr("id");

            /* check if then it is a game <dl> */
            if (id != null && id.length() > 3 && "rid".equals(id.substring(0, 3))) {

                System.out.println("Game: " + link.text());

                String idNoRID = id.replace("rid", "");
                // String idNoRID = "1-1229442";
                String scoreURL = "http://www.skore.com/en/scores/soccer/id/" + idNoRID + "?fmt=html";
                Document docScore = Jsoup.connect(scoreURL).get();

                Elements trs = docScore.select("tr");
                for (Element tr : trs) {
                    Elements spanGoal = tr.select("span.goal");
                    /* only enter if there is a goal */
                    if (spanGoal.size() > 0) {
                        Elements score = tr.select("td.score");
                        String playerName = spanGoal.get(0).text();
                        String currentScore = score.get(0).text();
                        System.out.println("\t\tGOAL: " + currentScore + ": " + playerName);
                    }

                    Elements spanGoalPenalty = tr.select("span.goalpenalty");
                    /* only enter if there is a goal */
                    if (spanGoalPenalty.size() > 0) {
                        Elements score = tr.select("td.score");
                        String playerName = spanGoalPenalty.get(0).text();
                        String currentScore = score.get(0).text();
                        System.out.println("\t\tGOAL: " + currentScore + ": " + playerName + " (penalty)");
                    }

                    Elements spanGoalOwn = tr.select("span.goalown");
                    /* only enter if there is a goal */
                    if (spanGoalOwn.size() > 0) {
                        Elements score = tr.select("td.score");
                        String playerName = spanGoalOwn.get(0).text();
                        String currentScore = score.get(0).text();
                        System.out.println("\t\tGOAL: " + currentScore + ": " + playerName + " (own goal)");
                    }
                }
            }
        }
    }
}

<强>输出:

title: Skore : Premier League, England - Soccer Results (All)
Game: F T Arsenal 3 - 1 Norwich
        GOAL: 0 - 1: Michael Turner
        GOAL: 1 - 1: Mikel Arteta (penalty)
        GOAL: 2 - 1: Sébastien Bassong (own goal)
        GOAL: 3 - 1: Lukas Podolski
Game: F T Aston Villa 1 - 1 Fulham
        GOAL: 1 - 0: Charles N´Zogbia
        GOAL: 1 - 1: Fabian Delph (own goal)
Game: F T Everton 2 - 0 Queens Park Rangers
        GOAL: 1 - 0: Darron Gibson
        GOAL: 2 - 0: Victor Anichebe
Game: F T Reading 0 - 0 Liverpool
Game: F T Southampton 1 - 1 West Ham
        GOAL: 1 - 0: Gaston Ramirez
        GOAL: 1 - 1: Andrew Carroll
Game: F T Manchester United 1 - 2 Manchester City
        GOAL: 0 - 1: James Milner
...

使用了JSoup 1.7.1。如果使用maven,请将其添加到pom.xml

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.7.1</version>
</dependency>