HtmlUnit:在AJAX页面上加载元素

时间:2015-06-25 16:42:59

标签: javascript java ajax web-scraping htmlunit

我是Java和HtmlUnit的新手,我正试图从通过AJAX调用加载这些更新的页面中删除新闻更新。无论我似乎在做什么,更新都没有得到加载。我错过了什么?

我尝试了几种等待JS脚本完成的方法,但无济于事。点击按钮加载更多新闻或解雇他们的活动也似乎没有帮助。

我一直在假设在JS脚本完成后我不需要重新分配我的page实例。是吗?

我也一直在读HtmlUnit的JS引擎在某些网站上运行得不好。这是这种情况还是我只是遗漏了什么?

感谢您的帮助!

这是我的代码:

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.NicelyResynchronizingAjaxController;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlButton;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlInput;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.io.IOException;
import java.util.List;
import org.junit.Assert;

public class ProblemDemo {
    public static void main(String[] args) throws IOException, InterruptedException {
        WebClient webClient = new WebClient(BrowserVersion.FIREFOX_38);
        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.setAjaxController(new NicelyResynchronizingAjaxController());
        webClient.getOptions().setTimeout(10000);
        webClient.setJavaScriptTimeout(10000);
        webClient.getOptions().setJavaScriptEnabled(true);

        // Login procedure
        HtmlPage page = webClient.getPage("https://login.xing.com/login");

        final HtmlForm form = (HtmlForm) page.getElementById("login-form");
        final HtmlInput userID = form.getInputByName("login_form[username]");
        final HtmlInput password = form.getInputByName("login_form[password]");
        final HtmlButton submit = form.getButtonByName("button");
        final HtmlInput remember = form.getInputByName("login_form[perm]");

        userID.setValueAttribute("user");
        password.setValueAttribute("pass");
        remember.setChecked(true);
        page = submit.click();

        Assert.assertEquals("Start | XING", page.getTitleText());

        //Navigate to page to be scraped
        page = webClient.getPage(
                "https://www.xing.com/companies/deutschepostag/updates");
        webClient.waitForBackgroundJavaScript(10*1000);
        System.out.println(page.getUrl().toString());
        System.out.println(page.asXml());

        //Print number of employees (works, not dynamic)
        HtmlElement result = page.getFirstByXPath("//div[@id='profile-nav-tabs']"
                + "/ul/li[@id='employees-tab']/a");
        System.out.println("Employees: " + result.getTextContent());

        //Print news (doesn't work)
        String news;
        List<HtmlElement> results = (List<HtmlElement>) page.getByXPath("//div"
                + "[@id='company-updates']/ul[@id='news-feed']/li/div"
                + "[@class='activity-content']");
        System.out.println("News found: " + results.size());
        for(HtmlElement item : results){
            news = "";
            System.out.println("            NEW ITEM");
            System.out.println(item.getTextContent());
        }
    }
}

另外,以下警告是否相关?由于HtmlUnit产生了大量的JS警告,我不确定哪些是重要的,哪些不重要。

WARNING: Obsolete content type encountered: 'text/javascript'.

1 个答案:

答案 0 :(得分:0)

setThrowExceptionOnScriptError设置为false会阻止您看到错误。

编辑:最新snapshot包含performance.navigation.redirectCount的修正

请尝试并还原