我正在尝试在htmlunit中编写一个程序来从网站上抓取源代码并将其返回。我的代码目前是:
public class Htmlunitscraper {
private static String s = "website";
public static HtmlPage scrapeWebsite() throws IOException {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage(s);
return page.getPage();
}
}
我认为方法getPage会返回源代码,但我一直遇到错误而且只返回了url。这些错误是:
Oct 16, 2013 4:07:59 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
Oct 16, 2013 4:08:00 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError
SEVERE: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://cpdocket.cp.cuyahogacounty.us/SheriffSearch/Scripts/jquery.js] line=[2] lineSource=[null] lineOffset=[0]
Oct 16, 2013 4:08:00 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
Oct 16, 2013 4:08:00 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
Oct 16, 2013 4:08:00 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
Oct 16, 2013 4:08:01 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
Oct 16, 2013 4:08:01 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError
SEVERE: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://cpdocket.cp.cuyahogacounty.us/SheriffSearch/ScriptResource.axd?d=0XCJGMnW_16F7h4EC7avEaQ_Ma7RLZvTA2-XkhkFcfSnWFOkCRjbat77Yi12o3uS3yGC-YMdXQ_w3i5MHWALH-xBqxutgCryrSWcT8prtHkRngrJRiKTP-EYEm1QJ6zB0&t=ffffffff823b7694] line=[2] lineSource=[null] lineOffset=[0]
Oct 16, 2013 4:08:01 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNING: Obsolete content type encountered: 'application/x-javascript'.
HtmlPage(http://cpdocket.cp.cuyahogacounty.us/SheriffSearch/results.aspx?q=searchType%3dSaleDate%26searchString%3d10%2f21%2f2013%26foreclosureType%3d%27NONT%27%2c+%27PAR%27%2c+%27COMM%27%2c+%27TXLN%27)@1134201154
我没有使用正确的方法来返回源代码,因为我找不到如何做到这一点的好例子。
答案 0 :(得分:1)
您应该通过执行以下操作来查看页面内容:
System.out.println(page.asXml());
这将以格式良好的方式打印出来。
您看到的所有其他内容都是来自您正在抓取的网页的javascript错误。
如果您需要页面中的源代码而不进行格式化,请检查以下答案:
检查此答案以关闭这些警告:
答案 1 :(得分:0)
试试这段代码,运行正常:
public class Htmlunitscraper {
private static String s = "website";
public static HtmlPage scrapeWebsite() throws IOException {
final WebClient webClient = new WebClient();
final HtmlPage page = webClient.getPage(s);
return page.asXml();
}
}