目前我正在编写一个基于HTMLunit的Web抓取器,用于特定的收获 汉诺威展览会网站上的公司名称和详细信息。 我的努力似乎遇到了一个障碍,因为我无法得到 搜索结果页面上的页面前进按钮可以正常工作。
参赛网站是
www.hannovermesse.de/en/exhibition/exhibitors-products/advanced-search/
然后在Checkboxes(EU Region,Industrial Automation / Robotics)中设置一些搜索过滤器。
提交表单并加载搜索结果后,我得到大约400次点击,
当我选择参展商标签时,我会收到第一个结果页面。
搜索结果显示在
//www.hannovermesse.de/en/exhibition/exhibitors-products/search
注意:您需要运行整个序列才能进入结果屏幕!它似乎
使用会话/ cookie数据来确定要显示的内容,默认情况下
什么也没显示。
这使我在第一页上有20次点击并显示在页面底部
选择器,选择第1页。
“[<] [1] 2 ... | n [>]”
为了收获所有联系人,我需要点击所有屏幕
在搜索结果中显示。
所以我的想法是使用右手按钮遍历页面和 在我继续收集每个页面上的公司详细信息并在何时终止循环 右键不再有效。 我找到了正确的按钮,使用各种方法,如getXPath,验证它和我 甚至通过添加一个Name属性来修改它,所以我可以找到它 通常的HTMLanchor生成功能。
结果总是出现运行时错误并中止。
日志消息是:
Mai 01,2016 6:05:11 PM com.gargoylesoftware.htmlunit.html.HtmlScript isExecutionNeeded 警告:脚本不是JavaScript(类型:text / html,语言:)。跳过执行。
Mai Mai,2016 6:05:12 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 严重:runtimeError:message = [指定了无效或非法的选择器(选择器:'*,:x'错误:无效的选择器:: x)。] sourceName = [http://www.hannovermesse.de/files/001-fs5/media/layout/js/dmag.min.js] line = [2] lineSource = [null ] lineOffset = [0]
Mai Mai,2016 6:05:12 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError 严重:runtimeError:message = [指定了无效或非法选择器(选择器:'[id ='sizzle-1462118712173']:已选择'错误:选择器无效:[id =“sizzle-1462118712173”]:已选中)。] sourceName = [http://www.hannovermesse.de/files/001-fs5/media/layout/js/dmag.min.js] line = [2] lineSource = [null] lineOffset = [0]
Mai 01,2016 6:05:12 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 警告:遇到过时的内容类型:'text / javascript'。
Mai 01,2016 6:05:17 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify 警告:遇到过时的内容类型:'application / x-javascript'。
尝试了各种浏览器选项设置,但没有任何乐趣。我发现了这个 “指定了无效或非法的选择器(选择器:'*,:x'错误:无效 选择器:: x)。“ - 错误有时会出现蜘蛛和另一个测试 浏览器。有一个 “webClient()。waitForBackgroundJavaScriptStartingBefore(5000);”“解决了这个问题。 我试过了,但它对我没用。
我附上我快速而肮脏的概念验证Java程序供您参考。 我正在使用带有Java JRE 1.8,JUnit4和HTMLunit 2.22库的Eclipse MARS
任何人都知道发生了什么,或者要改变什么才能让它发挥作用?我不能 相信我是第一个偶然发现这个的人!
我的Java代码:
/*---------------------------------------------------------------------------------*/
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import org.w3c.dom.NamedNodeMap;
import org.w3c.dom.Node;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.WebClientOptions;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlCheckBoxInput;
import com.gargoylesoftware.htmlunit.html.HtmlElement;
import com.gargoylesoftware.htmlunit.html.HtmlForm;
import com.gargoylesoftware.htmlunit.html.HtmlOption;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.gargoylesoftware.htmlunit.html.HtmlSelect;
import com.gargoylesoftware.htmlunit.html.HtmlSubmitInput;
public class App {
static WebClient webClient;
static String[] countries = {
"European Union"
};
static String[] categories = {
"Robotics"
};
@SuppressWarnings("deprecation")
public static void main(String[] args) throws Exception {
setUp();
HtmlPage currentPage = webClient.getPage("http://www.hannovermesse.de/en/exhibition/exhibitors-products/advanced-search/");
System.out.println(currentPage.getTitleText()+"Web page open\n------------------------------------------------------------------\n");
registerCountries(currentPage);
registerCategories(currentPage);
System.out.println("Search filters registered\n------------------------------------------------------------------\n");
currentPage = submitSearchRequest(currentPage);
System.out.println("Search filters submitted and results loaded\n------------------------------------------------------------------\n");
selectExhibitorView(currentPage);
System.out.println("Exhibitor View selected\n------------------------------------------------------------------\n");
showCriteria(currentPage);
showResultsCount(currentPage);
HtmlPage backupPage = currentPage;
for(int n=0, tn=0; n<1; n++){
System.out.println("========================================================================================");
System.out.println(" Results page "+n+1);
HtmlAnchor nextPageButton = (HtmlAnchor) currentPage.getFirstByXPath(".//div[@class=\"col s-col12 m-col12 l-col12\"]/ul/following-sibling::a");
String classValue = nextPageButton.getAttribute("class");
nextPageButton.setAttribute("name", nextPageButton.getAttribute("class").trim());
NamedNodeMap attribList = nextPageButton.getAttributes();
for (int i=0; i < attribList.getLength(); i++) {
Node node = attribList.item(i);
String key=node.getLocalName();
String val=node.getNodeValue();
System.out.printf("[%-15s] : '%s'\n", key, val);
}
List <HtmlElement> elementList = (List<HtmlElement>)currentPage.getByXPath(".//h4[@itemprop=\"name\"]/text()");
int i=0;
for(; i<elementList.size();i++){
System.out.printf("[%3d] '%s'\n", +(tn+i), elementList.get(i));
}
tn=i;
System.out.println("Next Button :");
final HtmlAnchor newPageLink = (HtmlAnchor) currentPage.getAnchorByName(classValue.trim());
currentPage = (HtmlPage) newPageLink.click();
currentPage = nextPageButton.click();
System.out.println("===========>[13]");
}
currentPage = backupPage;
System.out.println("Done");
webClient.close();
}
private static void showResultsCount(HtmlPage currentPage) {
String results = "";
int count;
results = (String) currentPage.getByXPath("String("+".//div[@class=\"col l-col8 m-col7 s-col12\"]/p[@class=\"query-text\"]/text()"+")").get(0);
publish("Raw results : "+results);
count= Integer.parseInt(results.split(" ")[0]);
publish("Results : "+count+" found.\n");
}
private static void selectExhibitorView(HtmlPage currentPage) {
HtmlSelect select = (HtmlSelect) currentPage.getElementById("searchResult:resultType");
HtmlOption option = select.getOptionByValue("1");
select.setSelectedAttribute(option, true);
}
private static HtmlPage submitSearchRequest(HtmlPage currentPage) {
try {
final HtmlForm form = (HtmlForm) currentPage.getFormByName("searchAP:search");
final HtmlSubmitInput button = form.getInputByName("searchAP:searchButton2");
currentPage = (HtmlPage) button.click();
System.out.println(currentPage.getTitleText());
} catch (Exception e) {
System.out.println("===> Cannot submit Search Form, no submit button found!");
}
return currentPage;
}
private static void showCriteria(HtmlPage currentPage) {
publish("Filtercriteria for this search:");
String results = "";
results = (String) currentPage.getByXPath("String(.//h1[contains(text(), \"Search Result\")]/following-sibling::p)").get(0);
String[] criteria= results.split(",");
String key = "";
Map<String, ArrayList<String>> cMap = new LinkedHashMap<String, ArrayList<String>>();
ArrayList<String> value = new ArrayList<String>();
cMap.put(key, value);
for(int i=0; i<criteria.length; i++){
if(criteria[i].contains(":")){
String workCopy = new String(criteria[i]);
String[] bits= workCopy.split(":");
key = bits[0].trim();
criteria[i]=bits[1].trim();
value = new ArrayList<String>();
cMap.put(key, value);
}
value.add(criteria[i].trim());
}
for (Map.Entry<String, ArrayList<String>> entry : cMap.entrySet()) {
key = entry.getKey();
value = entry.getValue();
if(!value.isEmpty()){
System.out.println(key+": ");
for (int i = 0; i < value.size(); i++) {
System.out.println(" "+value.get(i));
}
}
}
}
public static void publish(String text) {
System.out.println(text);
}
public static void registerCountries(HtmlPage currentPage) {
for(int i=0;i < countries.length; i++){
setCountryCheckbox(currentPage, countries[i]);
}
}
public static void registerCategories(HtmlPage currentPage) {
for(int i=0;i < categories.length; i++){
setCategoryCheckbox(currentPage, categories[i]);
}
}
public static void setCountryCheckbox(HtmlPage currentPage, String text) {
String label="";
HtmlCheckBoxInput input;
try {
label = (String) currentPage.getByXPath("String(.//label[contains(text(), \""+text+"\")]/@for)").get(0);
System.out.print(text);
input = currentPage.getHtmlElementById(label);
input.setChecked(true);
System.out.println(": "+(input.isChecked()?"SET":""));
} catch (Exception e) {
System.out.println("\rError: Label ID for '"+text+"' not found. ");
}
}
public static void setCategoryCheckbox(HtmlPage currentPage, String text) {
String label="";
HtmlCheckBoxInput input;
String XPathXpression = ".//strong[contains(text(), \""+text+"\")]/parent::div/input/@id";
try {
label = (String) currentPage.getByXPath("String("+XPathXpression+")").get(0);
System.out.print(text+" : "+"'"+label+"' ");
input = currentPage.getHtmlElementById(label);
input.setChecked(true);
System.out.println(": "+(input.isChecked()?"SET":""));
} catch (Exception e) {
System.out.println("\rError: Label ID for '"+text+"' not found. ");
}
}
public static void setUp() throws InterruptedException {
webClient = new WebClient(BrowserVersion.FIREFOX_45);
WebClientOptions options = webClient.getOptions();
options.setPrintContentOnFailingStatusCode(true);
options.setJavaScriptEnabled(true);
options.setThrowExceptionOnScriptError(false);
options.setThrowExceptionOnFailingStatusCode(false);
webClient.waitForBackgroundJavaScriptStartingBefore(5000);
}
}
答案 0 :(得分:0)
如果您使用 HtmlSubmitInput 作为按钮,HTMLUnit会尝试查找输入类型字段,而不是找到按钮。
使用 HtmlButton 代替 HtmlSubmitInput
这是一个例子。
HtmlButton button = form.getButtonByName(&#34; submitButton&#34;);
答案 1 :(得分:0)
只有两个提示:
指定了无效或非法的选择器....在使用带有HtmlUnit的jQuery测试Web应用程序时,这是一个非常常见的输出。这意味着jQuery会执行一些调用来检查浏览器支持的css选择器的功能。因为HtmlUnit在构造时记录异常,所以您将看到此日志输出。稍后将从(jQuery)java代码处理异常。通常你可以忽略它。
webClient.waitForBackgroundJavaScriptStartingBefore(5000);不是一种选择。此调用不会设置任何等待超时。通常在触发某些操作后,您必须将此调用置于正常的应用程序流程中。如果您触发Ajax操作,则可能需要这样做。