使用HtmlUnit获取页面源:URL卡住了

时间:2013-01-18 09:39:39

标签: web-scraping monitoring htmlunit

我正在尝试使用Html-Unit get方法获取以下URL的页面源。

http://denydesigns.com/collections/barbara-sherman-fleece-throw-blanket/products/barbara-sherman-antique-fleece-throw-blanket

它被困在某个地方。我试图找出原因,但我没有得到它。 我还试图看看HtmlUnit创建的Thread是否是BLOCKED和WAITING,但事实并非如此。

以下是我的HTML单元生成的日志。

18 Jan 2013 04:14:47,832 -  main - ERROR - com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter.runtimeError(StrictErrorReporter.java:79) - runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://ajax.googleapis.com/ajax/libs/jquery/1.4.2/jquery.min.js] line=[16] lineSource=[null] lineOffset=[0]
18 Jan 2013 04:14:47,924 -  main -  WARN - com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument.jsxFunction_getElementById(HTMLDocument.java:1049) - getElementById(script1358500487923) did a getElementByName for Internet Explorer
18 Jan 2013 04:14:49,498 -  main - ERROR - com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter.runtimeError(StrictErrorReporter.java:79) - runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://code.jquery.com/jquery-latest.js] line=[911] lineSource=[null] lineOffset=[0]
18 Jan 2013 04:14:49,565 -  main -  WARN - com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument.jsxFunction_getElementById(HTMLDocument.java:1049) - getElementById(sizzle-1358500489525) did a getElementByName for Internet Explorer
18 Jan 2013 04:14:53,047 -  main -  WARN - com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject.jsConstructor(ActiveXObject.java:128) - Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'.
18 Jan 2013 04:14:53,048 -  main - ERROR - com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter.runtimeError(StrictErrorReporter.java:79) - runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.7'.] sourceName=[http://www.google-analytics.com/ga.js] line=[18] lineSource=[null] lineOffset=[0]
18 Jan 2013 04:14:53,060 -  main -  WARN - com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject.jsConstructor(ActiveXObject.java:128) - Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.6'.
18 Jan 2013 04:14:53,061 -  main - ERROR - com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter.runtimeError(StrictErrorReporter.java:79) - runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash.6'.] sourceName=[http://www.google-analytics.com/ga.js] line=[18] lineSource=[null] lineOffset=[0]
18 Jan 2013 04:14:53,061 -  main -  WARN - com.gargoylesoftware.htmlunit.javascript.host.ActiveXObject.jsConstructor(ActiveXObject.java:128) - Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash'.
18 Jan 2013 04:14:53,062 -  main - ERROR - com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter.runtimeError(StrictErrorReporter.java:79) - runtimeError: message=[Automation server can't create object for 'ShockwaveFlash.ShockwaveFlash'.] sourceName=[http://www.google-analytics.com/ga.js] line=[18] lineSource=[null] lineOffset=[0]
18 Jan 2013 04:14:53,829 -  main - ERROR - com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter.runtimeError(StrictErrorReporter.java:79) - runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://chat.livechatinc.net/licence/1051689/script.cgi?lang=en&groups=0] line=[60] lineSource=[null] lineOffset=[0]
18 Jan 2013 04:14:54,878 -  main - ERROR - com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter.runtimeError(StrictErrorReporter.java:79) - runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://platform.twitter.com/widgets.js] line=[5] lineSource=[null] lineOffset=[0]
18 Jan 2013 04:14:56,215 -  main -  WARN - com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument.jsxFunction_getElementById(HTMLDocument.java:1049) - getElementById(sizzle-1358500496196) did a getElementByName for Internet Explorer
18 Jan 2013 04:14:56,458 -  main -  WARN - com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument.jsxFunction_execCommand(HTMLDocument.java:1590) - Nothing done for execCommand(BackgroundImageCache, ...) (feature not implemented)
18 Jan 2013 04:14:58,086 -  main -  WARN - com.gargoylesoftware.htmlunit.javascript.host.html.HTMLDocument.jsxFunction_getElementById(HTMLDocument.java:1049) - getElementById(sizzle-1358500489525) did a getElementByName for Internet Explorer

以下是我创建的进程的线程转储(使用jstack)

2013-01-18 04:17:46
Full thread dump Java HotSpot(TM) 64-Bit Server VM (22.1-b02 mixed mode):

"Attach Listener" daemon prio=10 tid=0x0000000002955000 nid=0x16dd waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Service Thread" daemon prio=10 tid=0x00007feca00cc800 nid=0x154f runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread1" daemon prio=10 tid=0x00007feca00ca000 nid=0x154e waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" daemon prio=10 tid=0x00007feca00c7000 nid=0x154d waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00007feca00c5000 nid=0x154c runnable [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00007feca007c800 nid=0x154b in Object.wait() [0x00007fec9fffe000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000c2369e20> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135)
        - locked <0x00000000c2369e20> (a java.lang.ref.ReferenceQueue$Lock)
        at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151)
        at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177)

"Reference Handler" daemon prio=10 tid=0x00007feca007a000 nid=0x154a in Object.wait() [0x00007feca4157000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x00000000c23699e0> (a java.lang.ref.Reference$Lock)
        at java.lang.Object.wait(Object.java:503)
        at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133)
        - locked <0x00000000c23699e0> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x00000000025d9000 nid=0x1546 runnable [0x00007fecaa8b6000]
   java.lang.Thread.State: RUNNABLE
        at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.getTopLevelScope(ScriptableObject.java:2007)
        at com.gargoylesoftware.htmlunit.javascript.SimpleScriptable.getWindow(SimpleScriptable.java:303)
        at com.gargoylesoftware.htmlunit.javascript.SimpleScriptable.getWindow(SimpleScriptable.java:293)
        at com.gargoylesoftware.htmlunit.javascript.SimpleScriptable.getPrototype(SimpleScriptable.java:251)
        at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLCollection.<init>(HTMLCollection.java:99)
        at com.gargoylesoftware.htmlunit.javascript.host.html.HTMLCollection.<init>(HTMLCollection.java:110)
        at com.gargoylesoftware.htmlunit.javascript.host.HTMLCollectionFrames.<init>(Window.java:1751)
        at com.gargoylesoftware.htmlunit.javascript.host.Window.getFrames(Window.java:759)
        at com.gargoylesoftware.htmlunit.javascript.host.Window.jsxGet_length(Window.java:749)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at net.sourceforge.htmlunit.corejs.javascript.MemberBox.invoke(MemberBox.java:172)
        at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject$GetterSlot.getValue(ScriptableObject.java:342)
        at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.getImpl(ScriptableObject.java:2523)
        at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.get(ScriptableObject.java:438)
        at com.gargoylesoftware.htmlunit.javascript.SimpleScriptable.get(SimpleScriptable.java:75)
        at com.gargoylesoftware.htmlunit.javascript.host.Window.get(Window.java:1226)
        at net.sourceforge.htmlunit.corejs.javascript.ScriptableObject.getProperty(ScriptableObject.java:2088)
        at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getObjectProp(ScriptRuntime.java:1527)
        at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getObjectProp(ScriptRuntime.java:1513)
        at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1398)
        at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:854)
        at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:164)
        at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:429)
        at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:267)
        at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3183)
        at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:162)
        at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:538)
        at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:589)
        - locked <0x00000000c274d308> (a com.gargoylesoftware.htmlunit.html.HtmlPage)
        at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:537)
        at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:538)
        at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:545)
        at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:520)
        at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptFunctionIfPossible(HtmlPage.java:896)
        at com.gargoylesoftware.htmlunit.javascript.host.EventListenersContainer.executeEventListeners(EventListenersContainer.java:162)
        at com.gargoylesoftware.htmlunit.javascript.host.EventListenersContainer.executeBubblingListeners(EventListenersContainer.java:221)
        at com.gargoylesoftware.htmlunit.javascript.host.Node.fireEvent(Node.java:735)
        at com.gargoylesoftware.htmlunit.html.HtmlElement$2.run(HtmlElement.java:866)
        at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:537)
        at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:538)
        at com.gargoylesoftware.htmlunit.html.HtmlElement.fireEvent(HtmlElement.java:871)
        at com.gargoylesoftware.htmlunit.html.HtmlPage.executeEventHandlersIfNeeded(HtmlPage.java:1162)
        at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:202)
        at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:440)
        at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
        at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:389)

"VM Thread" prio=10 tid=0x00007feca0072800 nid=0x1549 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x00000000025e4000 nid=0x1547 runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x00000000025e5800 nid=0x1548 runnable

"VM Periodic Task Thread" prio=10 tid=0x00007feca00d7800 nid=0x1550 waiting on condition

JNI global references: 317

我不确定为什么网址被卡住了。 它不是出自方法。任何人都可以调查一下。

更新 com.gargoylesoftware.htmlunit.html.HTMLParser.HtmlUnitDOMBuilder.parse(XMLInputSource) 的      @override

    public void parse(final XMLInputSource inputSource) throws XNIException, IOException {
        final HtmlUnitDOMBuilder oldBuilder = page_.getBuilder();
        page_.setBuilder(this);
        try {
            super.parse(inputSource);
        }
        finally {
            page_.setBuilder(oldBuilder);
        }
    }

我从HtmlUnit和Debugged附加了HtmlUnit源代码。上面的方法没有完全执行。

另外,我将超时设置如下:

webClient.setTimeout(120000);

那么为什么2分钟后它没有出来并说SomeThingTimeOutException?

1 个答案:

答案 0 :(得分:1)

我已经跟进了HtmlUnit用户组。他们已经在2.12版本的HtmlUnit中解决了这个问题。请检查。