我正在尝试使用特定标记从网页(http://steamcommunity.com/id/Winning117/games/?tab=all)获取数据但我一直变为空。我想要的结果是获得特定游戏的“小时数” - 在这种情况下,Cluckles'Adventure。感谢任何帮助,谢谢:)
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class TestScrape {
public static void main(String[] args) throws Exception {
String url = "http://steamcommunity.com/id/Winning117/games/?tab=all";
Document document = Jsoup.connect(url).get();
Element playTime = document.select("div#game_605250").first();
System.out.println(playTime);
}
}
编辑:如何判断网页是否使用JavaScript,因此无法通过Jsoup进行解析?
答案 0 :(得分:1)
您想要抓取的页面是由js加载的,并且没有任何#game_605250元素可以使用js。所有数据都是使用js在页面中写入的。
但是当我将文档打印到文件时,我看到一些这样的数据:
<script language="javascript">
var rgGames = [{"appid":224260,"name":"No More Room in Hell","logo":"http:\/\/cdn.steamstatic.com.8686c.com\/steamcommunity\/public\/images\/apps\/224260\/670e9aba35dc53a6eb2bc686d302d357a4939489.jpg","friendlyURL":224260,"availStatLinks":{"achievements":true,"global_achievements":true,"stats":false,"leaderboards":false,"global_leaderboards":false},"hours_forever":"515","last_played":1492042097},{"appid":241540,"name":"State of Decay","logo":"http:\/\/....
然后,您可以提取&#39; rgGames&#39;通过一些StringTools并将其格式化为json obj。
它不是一种牧师方法,但它有效
答案 1 :(得分:1)
要在java代码中执行javascript,有Selenium:
Selenium-WebDriver使用每个浏览器直接调用浏览器 浏览器对自动化的原生支持。
要将其与maven一起使用,请使用此依赖项:
<dependency>
<groupId>org.seleniumhq.selenium</groupId>
<artifactId>selenium-server</artifactId>
<version>3.4.0</version>
</dependency>
接下来,我给你一个简单的JUnit测试代码,它创建WebDriver的实例并转到给定的url并执行简单的脚本来获取rgGames
。
文件chromedriver
您必须在https://sites.google.com/a/chromium.org/chromedriver/downloads下载。
package SeleniumProject.selenium;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Map;
import org.junit.After;
import org.junit.AfterClass;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;
import org.openqa.selenium.By;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriverService;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.openqa.selenium.support.ui.ExpectedCondition;
import org.openqa.selenium.support.ui.WebDriverWait;
import junit.framework.TestCase;
@RunWith(JUnit4.class)
public class ChromeTest extends TestCase {
private static ChromeDriverService service;
private WebDriver driver;
@BeforeClass
public static void createAndStartService() {
service = new ChromeDriverService.Builder()
.usingDriverExecutable(new File("D:\\Downloads\\chromedriver_win32\\chromedriver.exe"))
.withVerbose(false).usingAnyFreePort().build();
try {
service.start();
} catch (IOException e) {
System.out.println("service didn't start");
// TODO Auto-generated catch block
e.printStackTrace();
}
}
@AfterClass
public static void createAndStopService() {
service.stop();
}
@Before
public void createDriver() {
ChromeOptions chromeOptions = new ChromeOptions();
DesiredCapabilities capabilities = DesiredCapabilities.chrome();
capabilities.setCapability(ChromeOptions.CAPABILITY, chromeOptions);
driver = new RemoteWebDriver(service.getUrl(), capabilities);
}
@After
public void quitDriver() {
driver.quit();
}
@Test
public void testJS() {
JavascriptExecutor js = (JavascriptExecutor) driver;
// Load a new web page in the current browser window.
driver.get("http://steamcommunity.com/id/Winning117/games/?tab=all");
// Executes JavaScript in the context of the currently selected frame or
// window.
ArrayList<Map> list = (ArrayList<Map>) js.executeScript("return rgGames;");
// Map represent properties for one game
for (Map map : list) {
for (Object key : map.keySet()) {
// take each key to find key "name" and compare its vale to
// Cluckles' Adventure
if (key instanceof String && key.equals("name") && map.get(key).equals("Cluckles' Adventure")) {
// print all properties for game Cluckles' Adventure
map.forEach((key1, value) -> {
System.out.println(key1 + " : " + value);
});
}
}
}
}
}
正如您在
中看到的selenium loading页面driver.get("http://steamcommunity.com/id/Winning117/games/?tab=all");
要获取Winning117所有游戏的数据,它会返回rgGames
变量:
ArrayList<Map> list = (ArrayList<Map>) js.executeScript("return rgGames;");
答案 2 :(得分:0)
试试这个:
public class TestScrape {
public static void main(String[] args) throws Exception {
String url = "http://steamcommunity.com/id/Winning117/games/?tab=all";
Document document = Jsoup.connect(url).get();
Element playTime = document.select("div#game_605250");
Elements val = playTime.select(".hours_played");
System.out.println(val.text());
}
}