HTMLunit无法在href

时间:2016-05-01 18:07:24

标签: onclick href htmlunit mojarra

目前我正在编写一个基于HTMLunit的Web抓取器,用于特定的收获 汉诺威展览会网站上的公司名称和详细信息。 我的努力似乎遇到了一个障碍,因为我无法得到 搜索结果页面上的页面前进按钮可以正常工作。

参赛网站是
www.hannovermesse.de/en/exhibition/exhibitors-products/advanced-search/
然后在Checkboxes(EU Region,Industrial Automation / Robotics)中设置一些搜索过滤器。

提交表单并加载搜索结果后,我得到大约400次点击, 当我选择参展商标签时,我会收到第一个结果页面。 搜索结果显示在
//www.hannovermesse.de/en/exhibition/exhibitors-products/search
注意:您需要运行整个序列才能进入结果屏幕!它似乎 使用会话/ cookie数据来确定要显示的内容,默认情况下 什么也没显示。

这使我在第一页上有20次点击并显示在页面底部 选择器,选择第1页。
“[<] [1] 2 ... | n [>]”
为了收获所有联系人,我需要点击所有屏幕 在搜索结果中显示。

所以我的想法是使用右手按钮遍历页面和 在我继续收集每个页面上的公司详细信息并在何时终止循环 右键不再有效。 我找到了正确的按钮,使用各种方法,如getXPath,验证它和我 甚至通过添加一个Name属性来修改它,所以我可以找到它 通常的HTMLanchor生成功能。

结果总是出现运行时错误并中止。

日志消息是:

  

Mai 01,2016 6:05:11 PM com.gargoylesoftware.htmlunit.html.HtmlScript isExecutionNeeded   警告:脚本不是JavaScript(类型:text / html,语言:)。跳过执行。   
Mai Mai,2016 6:05:12 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError   严重:runtimeError:message = [指定了无效或非法的选择器(选择器:'*,:x'错误:无效的选择器:: x)。] sourceName = [http://www.hannovermesse.de/files/001-fs5/media/layout/js/dmag.min.js] line = [2] lineSource = [null ] lineOffset = [0]   
Mai Mai,2016 6:05:12 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError   严重:runtimeError:message = [指定了无效或非法选择器(选择器:'[id ='sizzle-1462118712173']:已选择'错误:选择器无效:[id =“sizzle-1462118712173”]:已选中)。] sourceName = [http://www.hannovermesse.de/files/001-fs5/media/layout/js/dmag.min.js] line = [2] lineSource = [null] lineOffset = [0]   
Mai 01,2016 6:05:12 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify   警告:遇到过时的内容类型:'text / javascript'。   
Mai 01,2016 6:05:17 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify   警告:遇到过时的内容类型:'application / x-javascript'。

尝试了各种浏览器选项设置,但没有任何乐趣。我发现了这个 “指定了无效或非法的选择器(选择器:'*,:x'错误:无效 选择器:: x)。“ - 错误有时会出现蜘蛛和另一个测试 浏览器。有一个 “webClient()。waitForBackgroundJavaScriptStartingBefore(5000);”“解决了这个问题。 我试过了,但它对我没用。

我附上我快速而肮脏的概念验证Java程序供您参考。 我正在使用带有Java JRE 1.8,JUnit4和HTMLunit 2.22库的Eclipse MARS

任何人都知道发生了什么,或者要改变什么才能让它发挥作用?我不能 相信我是第一个偶然发现这个的人!

我的Java代码:

/*---------------------------------------------------------------------------------*/
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;

import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebClientOptions;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlCheckBoxInput;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlOption;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlSelect;
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput;

public class App {
    static WebClient webClient;

    static String[] countries = {
                "European Union"
                                    };

    static String[] categories = {
                "Robotics"          
    };

    @SuppressWarnings("deprecation")
    public static void main(String[] args) throws Exception {

        setUp();

        HtmlPage currentPage = webClient.getPage("http://www.hannovermesse.de/en/exhibition/exhibitors-products/advanced-search/");
        System.out.println(currentPage.getTitleText()+"Web page open\n------------------------------------------------------------------\n");

        registerCountries(currentPage);
        registerCategories(currentPage);
        System.out.println("Search filters registered\n------------------------------------------------------------------\n");

        currentPage = submitSearchRequest(currentPage);
        System.out.println("Search filters submitted and results loaded\n------------------------------------------------------------------\n");

        selectExhibitorView(currentPage);
        System.out.println("Exhibitor View selected\n------------------------------------------------------------------\n");

        showCriteria(currentPage);

        showResultsCount(currentPage);

        HtmlPage backupPage = currentPage;

        for(int n=0, tn=0; n<1; n++){
            System.out.println("========================================================================================");
            System.out.println(" Results page "+n+1);

            HtmlAnchor nextPageButton = (HtmlAnchor) currentPage.getFirstByXPath(".//div[@class=\"col s-col12 m-col12 l-col12\"]/ul/following-sibling::a");
            String classValue = nextPageButton.getAttribute("class");
            nextPageButton.setAttribute("name", nextPageButton.getAttribute("class").trim());

            NamedNodeMap attribList = nextPageButton.getAttributes();
            for (int i=0; i < attribList.getLength(); i++) {
                Node node = attribList.item(i);
                String key=node.getLocalName();
                String val=node.getNodeValue();             
                System.out.printf("[%-15s] : '%s'\n", key, val);
            }

            List <HtmlElement> elementList = (List<HtmlElement>)currentPage.getByXPath(".//h4[@itemprop=\"name\"]/text()");         
            int i=0;
            for(; i<elementList.size();i++){
                System.out.printf("[%3d] '%s'\n", +(tn+i), elementList.get(i));
            }
            tn=i;

            System.out.println("Next Button :");
            final HtmlAnchor newPageLink  = (HtmlAnchor) currentPage.getAnchorByName(classValue.trim());
            currentPage = (HtmlPage) newPageLink.click();
            currentPage = nextPageButton.click();
            System.out.println("===========>[13]");

        }
        currentPage = backupPage;          

        System.out.println("Done");
        webClient.close();
    }

    private static void showResultsCount(HtmlPage currentPage) {
        String results = "";
        int count;
        results = (String) currentPage.getByXPath("String("+".//div[@class=\"col l-col8 m-col7 s-col12\"]/p[@class=\"query-text\"]/text()"+")").get(0);
        publish("Raw results : "+results);
        count= Integer.parseInt(results.split(" ")[0]);
        publish("Results : "+count+" found.\n");
    }

    private static void selectExhibitorView(HtmlPage currentPage) {
        HtmlSelect select = (HtmlSelect) currentPage.getElementById("searchResult:resultType");
        HtmlOption option = select.getOptionByValue("1");
        select.setSelectedAttribute(option, true);      
    }

    private static HtmlPage submitSearchRequest(HtmlPage currentPage) {
        try {       
            final HtmlForm form  = (HtmlForm) currentPage.getFormByName("searchAP:search");
            final HtmlSubmitInput button = form.getInputByName("searchAP:searchButton2");           
            currentPage = (HtmlPage) button.click();
            System.out.println(currentPage.getTitleText());
        } catch (Exception e) {
            System.out.println("===> Cannot submit Search Form, no submit button found!");
        }
        return currentPage; 
    }

    private static void showCriteria(HtmlPage currentPage) {
        publish("Filtercriteria for this search:");
        String results = "";
        results = (String) currentPage.getByXPath("String(.//h1[contains(text(), \"Search Result\")]/following-sibling::p)").get(0);

        String[] criteria= results.split(",");
        String key = "";
        Map<String, ArrayList<String>> cMap = new LinkedHashMap<String, ArrayList<String>>();
        ArrayList<String> value = new ArrayList<String>();
        cMap.put(key, value);

        for(int i=0; i<criteria.length; i++){
            if(criteria[i].contains(":")){
                String workCopy = new String(criteria[i]);
                String[] bits= workCopy.split(":");
                key = bits[0].trim();
                criteria[i]=bits[1].trim();
                value = new ArrayList<String>();
                cMap.put(key, value);
            }
            value.add(criteria[i].trim());
        }  

        for (Map.Entry<String, ArrayList<String>> entry : cMap.entrySet()) {
            key = entry.getKey();
            value = entry.getValue();
            if(!value.isEmpty()){
                System.out.println(key+": ");
                for (int i = 0; i < value.size(); i++) {
                    System.out.println("  "+value.get(i));
                }
            }
        }
    }

    public static void publish(String text) {
        System.out.println(text);       
    }
    public static void registerCountries(HtmlPage currentPage) {
        for(int i=0;i < countries.length; i++){
            setCountryCheckbox(currentPage, countries[i]);
        }
    }

    public static void registerCategories(HtmlPage currentPage) {
        for(int i=0;i < categories.length; i++){
            setCategoryCheckbox(currentPage, categories[i]);
        }       
    }

    public static void setCountryCheckbox(HtmlPage currentPage, String text) {
        String label="";
        HtmlCheckBoxInput input;

        try {
            label = (String) currentPage.getByXPath("String(.//label[contains(text(), \""+text+"\")]/@for)").get(0);
            System.out.print(text);
            input = currentPage.getHtmlElementById(label);
            input.setChecked(true);
            System.out.println(": "+(input.isChecked()?"SET":""));
        } catch (Exception e) {
            System.out.println("\rError: Label ID for '"+text+"' not found. ");
        }
    }

    public static void setCategoryCheckbox(HtmlPage currentPage, String text) {
        String label="";
        HtmlCheckBoxInput input;
        String XPathXpression = ".//strong[contains(text(), \""+text+"\")]/parent::div/input/@id";

        try {
            label = (String) currentPage.getByXPath("String("+XPathXpression+")").get(0);
            System.out.print(text+" : "+"'"+label+"' ");
            input = currentPage.getHtmlElementById(label);
            input.setChecked(true);
            System.out.println(": "+(input.isChecked()?"SET":""));
        } catch (Exception e) {
            System.out.println("\rError: Label ID for '"+text+"' not found. ");
        }
    }

    public static void setUp() throws InterruptedException {
          webClient = new WebClient(BrowserVersion.FIREFOX_45);
          WebClientOptions options = webClient.getOptions();
          options.setPrintContentOnFailingStatusCode(true);
          options.setJavaScriptEnabled(true);
          options.setThrowExceptionOnScriptError(false);
          options.setThrowExceptionOnFailingStatusCode(false);
          webClient.waitForBackgroundJavaScriptStartingBefore(5000);          
      }
}

2 个答案:

答案 0 :(得分:0)

如果您使用 HtmlSubmitInput 作为按钮,HTMLUnit会尝试查找输入类型字段,而不是找到按钮

使用 HtmlButton 代替 HtmlSubmitInput

这是一个例子。

  

HtmlButton button = form.getButtonByName(&#34; submitButton&#34;);

答案 1 :(得分:0)

只有两个提示:

  1. 指定了无效或非法的选择器....在使用带有HtmlUnit的jQuery测试Web应用程序时,这是一个非常常见的输出。这意味着jQuery会执行一些调用来检查浏览器支持的css选择器的功能。因为HtmlUnit在构造时记录异常,所以您将看到此日志输出。稍后将从(jQuery)java代码处理异常。通常你可以忽略它。

  2. webClient.waitForBackgroundJavaScriptStartingBefore(5000);不是一种选择。此调用不会设置任何等待超时。通常在触发某些操作后,您必须将此调用置于正常的应用程序流程中。如果您触发Ajax操作,则可能需要这样做。