我正在尝试从单击链接时使用__doPostBack函数的ASP页面中抓取数据。当我点击()与HTMLUnit的链接时,它返回我开始的页面。我需要做什么才能完成回发并返回下一页?
代码:
import java.util.List;
import com.gargoylesoftware.htmlunit.ScriptResult;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.html.HtmlAnchor;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class ScrapperApp {
private static void go() throws Exception {
/* turn off annoying htmlunit warnings */
java.util.logging.Logger.getLogger("com.gargoylesoftware").setLevel(java.util.logging.Level.OFF);
HtmlPage nextPage;
ScriptResult onClick;
String url = "http://media.ethics.ga.gov/search/Campaign/Campaign_Name.aspx?NameID=5751&FilerID=C2009000085&Type=candidate";
final WebClient webclient = new WebClient(BrowserVersion.CHROME_16);
final HtmlPage page = webclient.getPage(url);
System.out.println("PULLING LINKS:");
List<HtmlAnchor> articles = (List<HtmlAnchor>) page.getByXPath("//table[@id='ctl00_ContentPlaceHolder1_Name_Reports1_TabContainer1_TabPanel1_dgReports']/tbody/tr/td/a[@class='lblentrylink']");
for(int x=0; x<articles.size(); x++) {
System.out.println("Clicking "+x+": "+articles.get(x).asText());
nextPage = articles.get(x).click();
System.out.println(nextPage.getUrl());
}
}
public static void main(String[] args) throws Exception {
go();
System.out.println("COMPLETE");
}
}
输出:
PULLING LINKS:
Clicking 0:
http://media.ethics.ga.gov/search/Campaign/Campaign_Name.aspx?NameID=5751&FilerID=C2009000085&Type=candidate
Clicking 1:
http://media.ethics.ga.gov/search/Campaign/Campaign_Name.aspx?NameID=5751&FilerID=C2009000085&Type=candidate
...
答案 0 :(得分:4)
您无法在这些链接上循环播放,就像您无法在浏览器的新窗口中打开它们一样。
你需要回到&#34; url&#34;每次点击的页面。
此外,需要对htmlunit进行一些调整以使其正常工作,这是我的工作代码。
webClient = new WebClient(BrowserVersion.FIREFOX_24);
webClient.getOptions().setTimeout(120000);
webClient.waitForBackgroundJavaScript(60000);
webClient.getOptions().setRedirectEnabled(true);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.getOptions().setCssEnabled(false);
webClient.getOptions().setUseInsecureSSL(true);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());
String url = "http://media.ethics.ga.gov/search/Campaign/Campaign_Name.aspx?NameID=5751&FilerID=C2009000085&Type=candidate";
HtmlPage rootPage = webClient.getPage(url);
List<String> texts = new ArrayList<String>();
for (HtmlAnchor a : (List<HtmlAnchor>) rootPage.getByXPath("//table[@id='ctl00_ContentPlaceHolder1_Name_Reports1_TabContainer1_TabPanel1_dgReports']/tbody/tr/td/a[@class='lblentrylink']")) {
rootPage = webClient.getPage(url);
HtmlPage page = a.click();
String text = page.asText();
if (!texts.contains(text)) {
System.out.println(page.getUrl());
texts.add(text);
} else {
System.out.println("already seen");
}
}
for (String s : texts) {
System.out.println(s);
}