我正在尝试对this site进行一些抓取,以编程方式查找轮询信息。我最初尝试使用Python,它非常适合加载网站并在aspx
表单周围导航,但无法提取嵌入的地图数据(因为没有包(至今)处理javascript)。所以我选择了除掉我的Java技能并打破HtmlUnit。但是,我几乎立即遇到了障碍。
似乎网站上存在一些不存在的javascript文件的死链接。当HtmlUnit尝试加载它们时,它会获得404并自我毁灭。
Jul 21, 2013 9:51:22 PM com.gargoylesoftware.htmlunit.html.HtmlPage loadExternalJavaScriptFile
SEVERE: Error loading JavaScript from [http://www.eci-polldaymonitoring.nic.in/psl/GoogleMapForASPNet.ascx/jsdebug].
com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException: 404 Not Found for http://www.eci-polldaymonitoring.nic.in/psl/GoogleMapForASPNet.ascx/jsdebug
at com.gargoylesoftware.htmlunit.WebClient.throwFailingHttpStatusCodeExceptionIfNecessary(WebClient.java:544)
at com.gargoylesoftware.htmlunit.html.HtmlPage.loadJavaScriptFromUrl(HtmlPage.java:1119)
at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:1059)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:399)
at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:260)
at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:276)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:676)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:635)
at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170)
at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072)
at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206)
at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330)
at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3074)
at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2041)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:918)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:892)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:241)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:187)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:268)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:156)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:434)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:309)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:374)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:359)
at ScrapeTest$.main(ScrapeTest.scala:12)
at ScrapeTest.main(ScrapeTest.scala)
有没有办法告诉它(a)完全忽略404错误,或(b)忽略特定的javascript网址?
import com.gargoylesoftware.htmlunit.WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion
import com.gargoylesoftware.htmlunit.html.HtmlPage
object ScrapeTest {
def main(args: Array[String]): Unit = {
val pageurl = "http://www.eci-polldaymonitoring.nic.in/psl/"
val client = new WebClient(BrowserVersion.INTERNET_EXPLORER_8)
var response: HtmlPage = client.getPage(pageurl)
println(response.asText())
}
}
答案 0 :(得分:9)
简要介绍一下HtmlUnit JavaDoc似乎表明你应该可以使用WebClientOptions#setExceptionOnFailingStatusCode(boolean)
如,
import com.gargoylesoftware.htmlunit.WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion
import com.gargoylesoftware.htmlunit.html.HtmlPage
object ScrapeTest {
def main(args: Array[String]): Unit = {
val pageurl = "http://www.eci-polldaymonitoring.nic.in/psl/"
val client = new WebClient(BrowserVersion.INTERNET_EXPLORER_8)
// Don't throw exception on failing status code
client.getOptions.setExceptionOnFailingStatusCode(false)
var response: HtmlPage = client.getPage(pageurl)
println(response.asText())
}
}
如果这不起作用,您也可以尝试:
答案 1 :(得分:0)
我有同样的问题。我不希望HTMLUnit请求外部链接。另外我也不想打印出css / js警告和所有杂音。
我配置了HtmlUnit(使用Spring WebApplicationContext):
@NoArgsConstructor(access = PRIVATE)
public final class _MockWebClientCreator {
public static WebClient createWebClien(WebApplicationContext wac) {
WebClient webClient = MockMvcWebClientBuilder.webAppContextSetup(wac).build();
webClient.getOptions().setThrowExceptionOnScriptError(FALSE);
webClient.getOptions().setThrowExceptionOnFailingStatusCode(FALSE);
webClient.getOptions().setPrintContentOnFailingStatusCode(FALSE);
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.setJavaScriptErrorListener(new SilentJavaScriptErrorListener());
webClient.setWebConnection(new WebConnectionWrapper(webClient) { // Use only internal urls
@Override
public WebResponse getResponse(WebRequest request) throws IOException {
return (startsWith(request.getUrl().toString(), "http://localhost"))
? super.getResponse(request)
: new StringWebResponse("", request.getUrl());
}
});
webClient.setJavaScriptTimeout(Duration.ofSeconds(INTEGER_ONE).toMillis());
return webClient;
}
}