在HTMLUnit中提交按钮click()后,无法访问新页面

时间:2016-12-16 12:01:08

标签: java web-scraping htmlunit

问题如下:当我运行此代码时,它会一直运行到submitButton.fireEvent("onclick").getNewPage(),然后即使最后System.out.println(pageAfterLogin.getUrl().toString())没有执行,它似乎也会结束。执行程序时没有发生错误。

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlInput;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.util.List;

public class WebScraperHTMLUnit2 {

public static void main(String[] args) {
     try{
        WebClient wc = new WebClient();
        HtmlPage page = wc.getPage("https://www.google.com/");

        HtmlInput searchForm = (HtmlInput)page.getFirstByXPath("//input[@name='q']");
        searchForm.setValueAttribute("q");

        HtmlElement submitButton = page.getFirstByXPath("//button[@id='searchButton']");
        HtmlPage pageAfterLogin = (HtmlPage) submitButton.fireEvent("onclick").getNewPage();

        System.out.println(pageAfterLogin.getUrl().toString());   

    } catch (Exception ex) {}       
}    
}

以下是NetBeans的输出日志:

run:
дек 16, 2016 2:38:16 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
WARNING: CSS error: 'https://www.google.ru/' [1:14018] Error in expression. (Invalid token " ". Was expecting one of: <NUMBER>, "inherit", <IDENT>, <STRING>, <HASH>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <RESOLUTION_DPI>, <RESOLUTION_DPCM>, <PERCENTAGE>, <DIMENSION>, <UNICODE_RANGE>, <URI>, <FUNCTION>, "progid:".)
дек 16, 2016 2:38:16 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
WARNING: CSS error: 'https://www.google.ru/' [1:14042] Error in expression. (Invalid token " ". Was expecting one of: <NUMBER>, "inherit", <IDENT>, <STRING>, <HASH>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <RESOLUTION_DPI>, <RESOLUTION_DPCM>, <PERCENTAGE>, <DIMENSION>, <UNICODE_RANGE>, <URI>, <FUNCTION>, "progid:".)
дек 16, 2016 2:38:16 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'text/javascript'.
СБОРКА УСПЕШНО ЗАВЕРШЕНА (общее время: 3 секунды)

1 个答案:

答案 0 :(得分:1)

按钮的xpath不正确。按钮是:

<input value="Google Search" aria-label="Google Search" name="btnK" type="submit" jsaction="sf.chk">

您的代码应该是:

 try {
     final WebClient wc = new WebClient();
     wc.getOptions().setThrowExceptionOnScriptError(false);

     HtmlPage page = wc.getPage("https://www.google.com/");

     HtmlInput searchForm = page.getFirstByXPath("//input[@name='q']");
     searchForm.setValueAttribute("q");

     HtmlSubmitInput submitButton = page.getFirstByXPath("//input[@name='btnK']");
    HtmlPage pageAfterLogin = submitButton.click();

    System.out.println(pageAfterLogin.getUrl().toString());   

} catch (Exception e) {}

您需要将setThrowExceptionOnScriptError添加到false的原因是因为抛出了错误(原因不明),并且您不希望因此而停止执行代码。

根据this post,www.google.com上生成的HTML不断变化。 所以我的//输入[@name =&#39; btnK&#39;] xpath将来可能无效。