Jsoup ID选择不起作用

时间:2017-05-02 06:31:24

标签: java select jsoup

我正在尝试使用特定标记从网页(http://steamcommunity.com/id/Winning117/games/?tab=all)获取数据但我一直变为空。我想要的结果是获得特定游戏的“小时数” - 在这种情况下,Cluckles'Adventure。感谢任何帮助,谢谢:)

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class TestScrape {
    public static void main(String[] args) throws Exception {
        String url = "http://steamcommunity.com/id/Winning117/games/?tab=all";
        Document document = Jsoup.connect(url).get();

        Element playTime = document.select("div#game_605250").first();
        System.out.println(playTime);
    }
}

编辑:如何判断网页是否使用JavaScript,因此无法通过Jsoup进行解析?

3 个答案:

答案 0 :(得分:1)

您想要抓取的页面是由js加载的,并且没有任何#game_605250元素可以使用js。所有数据都是使用js在页面中写入的。

但是当我将文档打印到文件时,我看到一些这样的数据:

<script language="javascript">
        var rgGames = [{"appid":224260,"name":"No More Room in Hell","logo":"http:\/\/cdn.steamstatic.com.8686c.com\/steamcommunity\/public\/images\/apps\/224260\/670e9aba35dc53a6eb2bc686d302d357a4939489.jpg","friendlyURL":224260,"availStatLinks":{"achievements":true,"global_achievements":true,"stats":false,"leaderboards":false,"global_leaderboards":false},"hours_forever":"515","last_played":1492042097},{"appid":241540,"name":"State of Decay","logo":"http:\/\/....

然后,您可以提取&#39; rgGames&#39;通过一些StringTools并将其格式化为json obj。

它不是一种牧师方法,但它有效

答案 1 :(得分:1)

要在java代码中执行javascript,有Selenium:

  

Selenium-WebDriver使用每个浏览器直接调用浏览器   浏览器对自动化的原生支持。

要将其与maven一起使用,请使用此依赖项:

<dependency>
            <groupId>org.seleniumhq.selenium</groupId>
            <artifactId>selenium-server</artifactId>
            <version>3.4.0</version>
        </dependency>

接下来,我给你一个简单的JUnit测试代码,它创建WebDriver的实例并转到给定的url并执行简单的脚本来获取rgGames。 文件chromedriver您必须在https://sites.google.com/a/chromium.org/chromedriver/downloads下载。

package SeleniumProject.selenium;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Map;

import org.junit.After;
import org.junit.AfterClass;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.junit.runners.JUnit4;
import org.openqa.selenium.By;
import org.openqa.selenium.JavascriptExecutor;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriverService;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.DesiredCapabilities;
import org.openqa.selenium.remote.RemoteWebDriver;
import org.openqa.selenium.support.ui.ExpectedCondition;
import org.openqa.selenium.support.ui.WebDriverWait;

import junit.framework.TestCase;

@RunWith(JUnit4.class)
public class ChromeTest extends TestCase {

    private static ChromeDriverService service;
    private WebDriver driver;

    @BeforeClass
    public static void createAndStartService() {
        service = new ChromeDriverService.Builder()
                .usingDriverExecutable(new File("D:\\Downloads\\chromedriver_win32\\chromedriver.exe"))
                .withVerbose(false).usingAnyFreePort().build();
        try {
            service.start();
        } catch (IOException e) {
            System.out.println("service didn't start");
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }

    @AfterClass
    public static void createAndStopService() {
        service.stop();
    }

    @Before
    public void createDriver() {
        ChromeOptions chromeOptions = new ChromeOptions();
        DesiredCapabilities capabilities = DesiredCapabilities.chrome();
        capabilities.setCapability(ChromeOptions.CAPABILITY, chromeOptions);
        driver = new RemoteWebDriver(service.getUrl(), capabilities);
    }

    @After
    public void quitDriver() {
        driver.quit();
    }

    @Test
    public void testJS() {
        JavascriptExecutor js = (JavascriptExecutor) driver;

        // Load a new web page in the current browser window.
        driver.get("http://steamcommunity.com/id/Winning117/games/?tab=all");

        // Executes JavaScript in the context of the currently selected frame or
        // window.
        ArrayList<Map> list = (ArrayList<Map>) js.executeScript("return rgGames;");
        // Map represent properties for one game
        for (Map map : list) {
            for (Object key : map.keySet()) {
                // take each key to find key "name" and compare its vale to
                // Cluckles' Adventure
                if (key instanceof String && key.equals("name") && map.get(key).equals("Cluckles' Adventure")) {
                    // print all properties for game Cluckles' Adventure
                    map.forEach((key1, value) -> {
                        System.out.println(key1 + " : " + value);
                    });
                }
            }
        }
    }
}

正如您在

中看到的selenium loading页面
driver.get("http://steamcommunity.com/id/Winning117/games/?tab=all");

要获取Winning117所有游戏的数据,它会返回rgGames变量:

ArrayList<Map> list = (ArrayList<Map>) js.executeScript("return rgGames;");

答案 2 :(得分:0)

试试这个:

public class TestScrape {
    public static void main(String[] args) throws Exception {
        String url = "http://steamcommunity.com/id/Winning117/games/?tab=all";
        Document document = Jsoup.connect(url).get();

        Element playTime = document.select("div#game_605250");
        Elements val = playTime.select(".hours_played");
        System.out.println(val.text());

    }
}