我正在尝试使用java中的jsoup和selenium从此website中的播放器统计信息表中解析和提取数据。
但是在解析具有多个页面的表时我遇到了问题。 有关如何解析表中所有页面的任何建议吗?
答案 0 :(得分:0)
您无法从此表中提取数据,因为网站会从网址加载数据:
http://www.whoscored.com/StatisticsFeed/1/GetPlayerStatistics?category=summary&subcategory=all&statsAccumulationType=0&isCurrent=true&playerId=&teamIds=&matchId=&stageId=&tournamentOptions=2,3,4,5,22&sortBy=Rating&sortAscending=&age=&ageComparisonType=&appearances=&appearancesComparisonType=&field=Overall&nationality=&positionOptions=&timeOfTheGameEnd=&timeOfTheGameStart=&isMinApp=true&page=&includeZeroValues=&numberOfPlayersToPick=10
它返回一个可以解析的JSON。
无论如何,您可以使用Jsoup从此URL读取内容,但它不适合执行此任务。
package com.github.davidepastore.stackoverflow33896871;
import java.io.IOException;
import org.json.JSONObject;
import org.jsoup.Connection.Response;
import org.jsoup.Jsoup;
/**
* Stackoverflow 33896871
*
*/
public class App {
private static String WHO_SCORED_URL = "http://www.whoscored.com/StatisticsFeed/1/GetPlayerStatistics?category=summary&subcategory=all&statsAccumulationType=0&isCurrent=true&playerId=&teamIds=&matchId=&stageId=&tournamentOptions=2,3,4,5,22&sortBy=Rating&sortAscending=&age=&ageComparisonType=&appearances=&appearancesComparisonType=&field=Overall&nationality=&positionOptions=&timeOfTheGameEnd=&timeOfTheGameStart=&isMinApp=true&page=%d&includeZeroValues=&numberOfPlayersToPick=10";
public static void main(String[] args) throws IOException, InterruptedException {
JSONObject jsonObject = executeRequest(1);
//Count the total number of pages
Integer totalPages = jsonObject.getJSONObject("paging").getInt("totalPages");
for(int i = 1; i < totalPages; i++){
//It's better to sleep for some seconds to avoid 403 errors
Thread.sleep(5000);
jsonObject = executeRequest(i);
handleData(jsonObject);
}
}
/**
* Execute the request to the server.
* @param page The page number.
* @return Returns the {@link JSONObject} from the body.
* @throws IOException
*/
private static JSONObject executeRequest(Integer page) throws IOException{
String url = String.format(WHO_SCORED_URL, page);
Response response = Jsoup.connect(url)
.timeout(10 * 2000) //20 seconds timeout
.header("Accept-Encoding", "gzip, deflate, sdch")
.header("Host", "www.whoscored.com")
.header("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36")
.header("Referer", "http://www.whoscored.com/Statistics")
.header("X-Requested-With", "XMLHttpRequest")
.ignoreContentType(true)
.execute();
String body = response.body();
JSONObject jsonObject = new JSONObject(body);
return jsonObject;
}
/**
* Handle data from the service.
* @param jsonObject The {@link JSONObject} received from the service.
*/
private static void handleData(JSONObject jsonObject){
//My amazing business logic
System.out.println(jsonObject);
}
}
在此示例中,我还使用org.json
json
20151123来解析JSON响应。
的pom.xml:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.github.davidepastore</groupId>
<artifactId>stackoverflow33896871</artifactId>
<version>0.0.1-SNAPSHOT</version>
<packaging>jar</packaging>
<name>stackoverflow33896871</name>
<url>http://maven.apache.org</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
</properties>
<dependencies>
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.3</version>
</dependency>
<dependency>
<groupId>org.json</groupId>
<artifactId>json</artifactId>
<version>20151123</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>3.8.1</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>