Jsoup返回活动文本字段

时间:2015-10-01 00:02:26

标签: java html web-scraping jsoup

所以看起来很简单,但我无法检索此网页上的文字,而且似乎正在改变。

package WorldBoss;


import org.jsoup.nodes.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.net.MalformedURLException;

public class WorldBoss {

    public static void main(String [] args) throws MalformedURLException {
        Document page = null;
        try {
            page = Jsoup.connect("http://wiki.guildwars2.com/wiki/World_boss").get();
        } catch (IOException e) {
            e.printStackTrace();
        }
        Elements allTimers = page.getElementsByClass("timerjs");
        String firstTime = allTimers.first().html();
        System.out.println(firstTime);
    }
}

它正在改变,因为它是倒计时。

在页面上的属性中,它表示innerHTML是正确的

enter image description here

有谁知道如何通过Jsoup获取此信息?

如果您要查看,该页面为here

1 个答案:

答案 0 :(得分:0)

正如Pshemo在评论中提到的,Jsoup是一个html解析器,所以它既不渲染页面也不执行脚本。

为了成功提取您想要的字段,我通过selenium使用phantomjs驱动程序对您的代码进行了一些修改。使用幻像获取和呈现页面,并将页面源传送到Jsoup进行解析。找到以下代码:

<div id="page-wrapper">
    <div class="box">
        <p>Stuff goes here</p>
    </div>
    <div class="box">
        <p>Stuff goes here</p>
    </div>
</div>

我使用了maven,因此pom文件中的依赖项是:

import org.jsoup.nodes.Document;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.phantomjs.PhantomJSDriver;
import org.openqa.selenium.remote.DesiredCapabilities;

public class WorldBoss {

    public static void main(String [] args) {

    WebDriver driver = new PhantomJSDriver(new DesiredCapabilities());
    driver.get("http://wiki.guildwars2.com/wiki/World_boss"); //retrieve page

    //It is very bad to wait explicitly, the best practice is to wait for a specific element on the page e.g. the element you're looking for [1]
    try { // wait to ensure page is loaded and java script is rendered
        Thread.sleep(3 * 1000);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }

    String pageSource = driver.getPageSource();
    Document page = Jsoup.parse(pageSource);
    Elements allTimers = page.getElementsByClass("timerjs");

    for (Element timer : allTimers) {
        //you can get whichever timer you want with it's index
        String firstTime = timer.html().trim();
        if (firstTime.isEmpty()) continue;
        //use timer for whatever you want
        System.out.println(firstTime);
    }
}
}

代码输出为:

    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.7.2</version>
    </dependency>
    <dependency>
        <groupId>org.seleniumhq.selenium</groupId>
        <artifactId>selenium-java</artifactId>
        <version>2.47.1</version>
    </dependency>
    <dependency>
        <groupId>com.github.detro.ghostdriver</groupId>
        <artifactId>phantomjsdriver</artifactId>
        <version>1.0.1</version>
    </dependency>

如果您的计算机上没有安装phantomjs,则需要安装它才能实现此功能。在基于debian的盒子上安装幻像:

Active
00:01:33
00:01:33
00:16:33
00:31:33
00:46:33

对于其他平台(或从源代码构建)see how to install phantom

希望这有帮助。

  1. How to wait for elements in selenium