使用Java中执行的Javascripts获取一个html文件

时间:2015-10-22 18:10:37

标签: javascript java html htmlunit

我最近发现了如何使用java获取HTML代码。

因此我编写了以下方法:

public String htmlToString(String urlString){
        //returns the html code of the given website into a string
        //if something doesn't work "fail" is returned
        try {
              //convert String to URL
              URL url = new URL(urlString);
              //read URL by Scanner
              Scanner s = new Scanner(url.openStream());
              //put token after token from the html file into a string
              String read = "";
              while(s.hasNext()){
                  read += s.next();
              }
              s.close();
              return read;
          }
          catch(IOException iOEx) {
              // there was some connection problem, or the file did not exist on the server,
              // or your URL was not in the right format.
              // think about what to do now, and put it here.
              iOEx.printStackTrace(); // for now, simply output it.
              return "fail";
          }catch(java.util.NoSuchElementException elEX){
              //couldn't find a next token
              //similar problemes as described before
              elEX.printStackTrace();
              return "fail";
          }
    }

我遇到的问题是,我正在查看包含大量javascripts的html代码,我可以使用它们,如果执行了javascripts,就好像它们是由浏览器打开而你查看了源代码。

有没有办法获取此代码?

=============================================== ===============================

编辑: 我现在尝试了以前从未使用过的htmlUnit,并想出了这段代码:

import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.io.IOException;


public class Converter2 {

    public String htmlToString(String url){
        try{
            WebClient webClient = new WebClient();
            HtmlPage page = webClient.getPage(url);
            String pageAsText = page.asText();
            webClient.close();
            return pageAsText;
        }catch(IOException ioEx){
            return "fail";
        }
    }
}

通过尝试运行它我遇到了很多错误。通过尝试亚马逊我得到这些错误:

    WARNUNG: CSS error: 'http://z-ecx.images-amazon.com/images/G/01/AUIClients/AmazonUI-2215197d18a3d0e321eb1a67a8b9e87ba4b4ab20._V2_.css#AUIClients/AmazonUI.rendering_engine-trident.min' [1:125781] Error in declaration. '*' ist als erstes Zeichen einer Property nicht erlaubt.
Okt 22, 2015 10:23:38 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
WARNUNG: CSS error: 'http://z-ecx.images-amazon.com/images/G/01/AUIClients/AmazonUI-2215197d18a3d0e321eb1a67a8b9e87ba4b4ab20._V2_.css#AUIClients/AmazonUI.rendering_engine-trident.min' [1:125797] Error in declaration. '*' ist als erstes Zeichen einer Property nicht erlaubt.
Okt 22, 2015 10:23:38 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
WARNUNG: CSS error: 'http://z-ecx.images-amazon.com/images/G/01/AUIClients/AmazonGatewayAuiAssets-3d5b6f366e05fa5c0b2f38dca7366948b0599a7b._V2_.css#AUIClients/AmazonGatewayAuiAssets.weblab-GW_NOT_INTERESTED_48787-C.min' [1:8806] Fehler in Ausdruck; ':' nach dem identifier "progid" gefunden.
Okt 22, 2015 10:23:38 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
WARNUNG: CSS error: 'http://z-ecx.images-amazon.com/images/G/01/AUIClients/AmazonGatewayAuiAssets-3d5b6f366e05fa5c0b2f38dca7366948b0599a7b._V2_.css#AUIClients/AmazonGatewayAuiAssets.weblab-GW_NOT_INTERESTED_48787-C.min' [1:8942] Fehler in Ausdruck; ':' nach dem identifier "progid" gefunden.
Okt 22, 2015 10:23:38 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNUNG: Obsolete content type encountered: 'application/x-javascript'.
Okt 22, 2015 10:23:38 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNUNG: Obsolete content type encountered: 'application/x-javascript'

通过尝试名为“csgolounge.com”的网站,它甚至更多:

    Okt 22, 2015 10:32:46 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNUNG: Obsolete content type encountered: 'text/javascript'.
Okt 22, 2015 10:32:47 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNUNG: Obsolete content type encountered: 'application/x-javascript'.
Okt 22, 2015 10:32:47 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError
SCHWERWIEGEND: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://csgolounge.com/script/jquery.min.js?1423740933] line=[2] lineSource=[null] lineOffset=[0]
Okt 22, 2015 10:32:47 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNUNG: Obsolete content type encountered: 'application/x-javascript'.
Okt 22, 2015 10:32:48 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNUNG: Obsolete content type encountered: 'text/javascript'.
Exception in thread "main" ======= EXCEPTION START ========
Exception class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException]
com.gargoylesoftware.htmlunit.ScriptException: TagError: adsbygoogle.push() error: No slot size for availableWidth=0 (http://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js#4)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:865)
    at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:747)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:722)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:945)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.executeInlineScriptIfNeeded(HtmlScript.java:351)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:411)
    at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:270)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:290)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:800)
    at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:757)
    at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170)
    at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072)
    at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206)
    at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3126)
    at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2093)
    at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920)
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
    at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:1040)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:253)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:199)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:272)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:160)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:476)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:350)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:415)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:400)
    at Internet.Converter2.htmlToString(Converter2.java:13)
    at main.mainMethod.main(mainMethod.java:8)
Caused by: net.sourceforge.htmlunit.corejs.javascript.JavaScriptException: [object Object] (http://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js#4)
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1006)
    at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798)
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
    at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:411)
    at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:310)
    at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3286)
    at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:738)
    at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:850)
    ... 33 more
JavaScriptException value = [object Object]
======= EXCEPTION END ========

我真的不明白,它试图告诉我什么。我迷路了。有人能帮助我吗?

1 个答案:

答案 0 :(得分:2)

您只能在获取网址时执行JavaScript。 JavaScript由浏览器运行,而不是服务器本身。