Question

我试图通过HttpResponse（和HttpClient），Jsoup和HtmlUnit（首先尝试它有效）来获取Zhaopin Login Page的html源代码但我还没有成功。这三个方法返回我混淆的html源代码（其中三个我尝试发送所有标题）。

所以我尝试使用PhantomJS，因为我等待页面的javascript执行，但我也没有成功。

是否有人使用过它？

以下是我使用的方法：

public static Document renderPage(String url) {
    System.setProperty("phantomjs.binary.path", "/usr/local/share/phantomjs-1.9.8-linux-x86_64/bin/phantomjs");
    WebDriver ghostDriver = new PhantomJSDriver();
    try {
        ghostDriver.manage().timeouts().setScriptTimeout(-1, TimeUnit.DAYS);
        ghostDriver.manage().timeouts().pageLoadTimeout(-1, TimeUnit.DAYS);
        ghostDriver.get(url);
        return Jsoup.parse(ghostDriver.getPageSource());
    } finally {
        ghostDriver.quit();
    }
}

谢谢！

Answer 1

这会产生页面的来源（至少这里是来自HtmlUnit的最新SNAPSHOT）。页面代码仍然包含很多javascript内容，但应该很容易将其移出。

    try (WebClient webClient = new WebClient(BrowserVersion.CHROME)) {
        final HtmlPage page = webClient.getPage("https://passport.zhaopin.com/org/login");
        webClient.waitForBackgroundJavaScript(10000);

        System.out.println(page.asXml());
    }

JAVA PhantomJS无法正常工作

1 个答案: