我正在使用HtmlUnit 2.9(本月发布的稳定版本)。你知道为什么以下代码不起作用吗?
public class Main {
public static void main(String[] args) {
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3_6);
webClient.setCssEnabled(true);
webClient.setCssErrorHandler(new SilentCssErrorHandler());
webClient.setThrowExceptionOnFailingStatusCode(false);
webClient.setThrowExceptionOnScriptError(false);
webClient.setRedirectEnabled(false);
webClient.setAppletEnabled(false);
webClient.setJavaScriptEnabled(false);
webClient.setPopupBlockerEnabled(true);
webClient.setTimeout(60000);
webClient.setPrintContentOnFailingStatusCode(false);
System.out.println("This is printed on screen");
try {
webClient.getPage("http://www.2cash.info/index.php");
} catch (Exception e) {
e.printStackTrace();
}
System.out.println("This is NEVER printed on screen");
}
}
我也在添加jstack的结果。请注意,我已经标记了一个不断重复的部分:
2011-08-26 03:15:45
Full thread dump Java HotSpot(TM) Server VM (20.1-b02 mixed mode):
"Attach Listener" daemon prio=10 tid=0x09520400 nid=0x5363 waiting on condition [0x00000000]
java.lang.Thread.State: RUNNABLE
"JS executor for com.gargoylesoftware.htmlunit.WebClient@a7c45e" daemon prio=10 tid=0x6feb7400 nid=0x5356 waiting on condition [0x6fcfe000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutor.run(JavaScriptExecutor.java:166)
at java.lang.Thread.run(Thread.java:662)
"Low Memory Detector" daemon prio=10 tid=0x70204c00 nid=0x5352 runnable [0x00000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread1" daemon prio=10 tid=0x70202800 nid=0x5351 runnable [0x00000000]
java.lang.Thread.State: RUNNABLE
"C2 CompilerThread0" daemon prio=10 tid=0x70200800 nid=0x5350 waiting on condition [0x00000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x09514c00 nid=0x534f runnable [0x00000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x09503400 nid=0x534e in Object.wait() [0x70798000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x76af2ff0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
- locked <0x76af2ff0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
"Reference Handler" daemon prio=10 tid=0x09501c00 nid=0x534d in Object.wait() [0x707e9000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x7675cc58> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x7675cc58> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x09482400 nid=0x5349 runnable [0xb6c34000]
java.lang.Thread.State: RUNNABLE
at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.getSlot(ScriptableObject.java:2603)
at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.defineProperty(ScriptableObject.java:1699)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.configureConstantsPropertiesAndFunctions(JavaScriptEngine.java:350)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.configureClass(JavaScriptEngine.java:330)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.init(JavaScriptEngine.java:199)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.access$000(JavaScriptEngine.java:79)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$1.run(JavaScriptEngine.java:146)
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:537)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:538)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.initialize(JavaScriptEngine.java:157)
at com.gargoylesoftware.htmlunit.WebClient.initialize(WebClient.java:1141)
at com.gargoylesoftware.htmlunit.WebWindowImpl.setEnclosedPage(WebWindowImpl.java:109)
at com.gargoylesoftware.htmlunit.html.FrameWindow.setEnclosedPage(FrameWindow.java:102)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:200)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
at com.gargoylesoftware.htmlunit.html.BaseFrame.<init>(BaseFrame.java:73)
at com.gargoylesoftware.htmlunit.html.HtmlInlineFrame.<init>(HtmlInlineFrame.java:46)
at com.gargoylesoftware.htmlunit.html.DefaultElementFactory.createElementNS(DefaultElementFactory.java:288)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.startElement(HTMLParser.java:506)
at org.apache.xerces.parsers.AbstractSAXParser.startElement(Unknown Source)
at org.cyberneko.html.HTMLTagBalancer.callStartElement(HTMLTagBalancer.java:1136)
at org.cyberneko.html.HTMLTagBalancer.startElement(HTMLTagBalancer.java:742)
at org.cyberneko.html.filters.DefaultFilter.startElement(DefaultFilter.java:136)
at org.cyberneko.html.filters.NamespaceBinder.startElement(NamespaceBinder.java:278)
at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLScanner.java:2652)
at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2022)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:908)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:789)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:225)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:179)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:221)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:106)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:433)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
<THIS_SECTION_IS_PRINTED_AS_IF_IT_WERE_IN_A_LOOP>
at com.gargoylesoftware.htmlunit.html.BaseFrame.loadInnerPageIfPossible(BaseFrame.java:149)
at com.gargoylesoftware.htmlunit.html.BaseFrame.loadInnerPage(BaseFrame.java:99)
at com.gargoylesoftware.htmlunit.html.HtmlPage.loadFrames(HtmlPage.java:1760)
at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:194)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:440)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
</THIS_SECTION_IS_PRINTED_AS_IF_IT_WERE_IN_A_LOOP>
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:373)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:358)
at main.Main.<init>(Main.java:42)
at main.Main.main(Main.java:23)
"VM Thread" prio=10 tid=0x094fe000 nid=0x534c runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x09489800 nid=0x534a runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x0948ac00 nid=0x534b runnable
"VM Periodic Task Thread" prio=10 tid=0x70207000 nid=0x5353 waiting on condition
JNI global references: 1234
我认为关于帧的自动加载存在某种循环。如果是这种情况,有没有办法禁用该行为来打破循环?
提前致谢!
答案 0 :(得分:2)
当我在浏览器中打开此站点时,它无法完成加载页面。这可能是HtmlUnit崩溃的原因。使用Chrome和FF进行测试。
尝试加载一个更简单的网站,您可能知道此崩溃是否是网站依赖的。
答案 1 :(得分:2)
嗯,虽然这是一个可怕的解决方案(解决方法,实际上......),但我最终决定在HtmlUnit中禁用自动加载帧,正如HtmlUnit的开发者之一所建议的那样。这就是我的细节:
loadFrames()
htmlunit-2.9/src/main/java/com/gargoylesoftware/htmlunit/html
方法的内容(方法的主体,而不是声明)进行了评论
mvn -Dmaven.test.skip=true clean compile package
htmlunit-2.9.jar
位于htmlunit-2.9/artifacts
并替换了当前的htmlunit-2.9.jar
库文件你知道我的原始代码是怎样的(看看这个问题)。这将从页面下载所有帧和iframe。我正在添加一个示例,说明如何获取带有帧的页面,只需加载所需的帧:
try {
HtmlPage page = webClient.getPage("http://www.w3schools.com/HTML/tryit.asp?filename=tryhtml_noframes");
HtmlInlineFrame frame = page.getFirstByXPath("//iframe[@name='view']");
page = webClient.getPage(page.getFullyQualifiedUrl(frame.getSrcAttribute()));
System.out.println(page.asXml());
} catch (Exception e) {
e.printStackTrace();
}
更改此库后,getPage()
方法完成后,帧的内容将为空。注意它不会为null,看起来它只是返回一个空框架。我们需要做的是手动下载我们感兴趣的帧的内容,这就是我再次执行getPage()
的原因。
这就是我设法用HtmlUnit选择性地下载帧和iframe的方法。任何关于如何改进这一点的想法将不胜感激。无论如何,我希望将来会添加一些方法来禁用HtmlUnit中的帧加载,可能会添加getPage(URL url, boolean downloadFrames)
之类的方法。
希望这有助于那里的人!