多页解析表jsoup

时间:2015-11-24 14:56:48

标签: java selenium jsoup

我正在尝试使用java中的jsoup和selenium从此website中的播放器统计信息表中解析和提取数据。

但是在解析具有多个页面的表时我遇到了问题。 有关如何解析表中所有页面的任何建议吗?

1 个答案:

答案 0 :(得分:0)

您无法从此表中提取数据,因为网站会从网址加载数据:

http://www.whoscored.com/StatisticsFeed/1/GetPlayerStatistics?category=summary&subcategory=all&statsAccumulationType=0&isCurrent=true&playerId=&teamIds=&matchId=&stageId=&tournamentOptions=2,3,4,5,22&sortBy=Rating&sortAscending=&age=&ageComparisonType=&appearances=&appearancesComparisonType=&field=Overall&nationality=&positionOptions=&timeOfTheGameEnd=&timeOfTheGameStart=&isMinApp=true&page=&includeZeroValues=&numberOfPlayersToPick=10

它返回一个可以解析的JSON。

无论如何,您可以使用Jsoup从此URL读取内容,但它不适合执行此任务。

package com.github.davidepastore.stackoverflow33896871;

import java.io.IOException;

import org.json.JSONObject;
import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;

/**
 * Stackoverflow 33896871
 *
 */
public class App {

    private static String WHO_SCORED_URL = "http://www.whoscored.com/StatisticsFeed/1/GetPlayerStatistics?category=summary&subcategory=all&statsAccumulationType=0&isCurrent=true&playerId=&teamIds=&matchId=&stageId=&tournamentOptions=2,3,4,5,22&sortBy=Rating&sortAscending=&age=&ageComparisonType=&appearances=&appearancesComparisonType=&field=Overall&nationality=&positionOptions=&timeOfTheGameEnd=&timeOfTheGameStart=&isMinApp=true&page=%d&includeZeroValues=&numberOfPlayersToPick=10";

    public static void main(String[] args) throws IOException, InterruptedException {
        JSONObject jsonObject = executeRequest(1);

        //Count the total number of pages
        Integer totalPages = jsonObject.getJSONObject("paging").getInt("totalPages");

        for(int i = 1; i < totalPages; i++){
            //It's better to sleep for some seconds to avoid 403 errors
            Thread.sleep(5000);

            jsonObject = executeRequest(i);
            handleData(jsonObject);
        }
    }

    /**
     * Execute the request to the server.
     * @param page The page number.
     * @return Returns the {@link JSONObject} from the body.
     * @throws IOException 
     */
    private static JSONObject executeRequest(Integer page) throws IOException{
        String url = String.format(WHO_SCORED_URL, page);
        Response response = Jsoup.connect(url)
                .timeout(10 * 2000) //20 seconds timeout
                .header("Accept-Encoding", "gzip, deflate, sdch")
                .header("Host", "www.whoscored.com")
                .header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36")
                .header("Referer", "http://www.whoscored.com/Statistics")
                .header("X-Requested-With", "XMLHttpRequest")
                .ignoreContentType(true)
                .execute();
        String body = response.body();
        JSONObject jsonObject = new JSONObject(body);
        return jsonObject;
    }

    /**
     * Handle data from the service.
     * @param jsonObject The {@link JSONObject} received from the service.
     */
    private static void handleData(JSONObject jsonObject){
        //My amazing business logic
        System.out.println(jsonObject);
    }
}

在此示例中,我还使用org.json json 20151123来解析JSON响应。

的pom.xml:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.github.davidepastore</groupId>
    <artifactId>stackoverflow33896871</artifactId>
    <version>0.0.1-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>stackoverflow33896871</name>
    <url>http://maven.apache.org</url>

    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.8.3</version>
        </dependency>
        <dependency>
            <groupId>org.json</groupId>
            <artifactId>json</artifactId>
            <version>20151123</version>
        </dependency>

        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>
    </dependencies>
</project>