我最近发现了如何使用java获取HTML代码。
因此我编写了以下方法:
public String htmlToString(String urlString){
//returns the html code of the given website into a string
//if something doesn't work "fail" is returned
try {
//convert String to URL
URL url = new URL(urlString);
//read URL by Scanner
Scanner s = new Scanner(url.openStream());
//put token after token from the html file into a string
String read = "";
while(s.hasNext()){
read += s.next();
}
s.close();
return read;
}
catch(IOException iOEx) {
// there was some connection problem, or the file did not exist on the server,
// or your URL was not in the right format.
// think about what to do now, and put it here.
iOEx.printStackTrace(); // for now, simply output it.
return "fail";
}catch(java.util.NoSuchElementException elEX){
//couldn't find a next token
//similar problemes as described before
elEX.printStackTrace();
return "fail";
}
}
我遇到的问题是,我正在查看包含大量javascripts的html代码,我可以使用它们,如果执行了javascripts,就好像它们是由浏览器打开而你查看了源代码。
有没有办法获取此代码?
=============================================== ===============================
编辑: 我现在尝试了以前从未使用过的htmlUnit,并想出了这段代码:
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import java.io.IOException;
public class Converter2 {
public String htmlToString(String url){
try{
WebClient webClient = new WebClient();
HtmlPage page = webClient.getPage(url);
String pageAsText = page.asText();
webClient.close();
return pageAsText;
}catch(IOException ioEx){
return "fail";
}
}
}
通过尝试运行它我遇到了很多错误。通过尝试亚马逊我得到这些错误:
WARNUNG: CSS error: 'http://z-ecx.images-amazon.com/images/G/01/AUIClients/AmazonUI-2215197d18a3d0e321eb1a67a8b9e87ba4b4ab20._V2_.css#AUIClients/AmazonUI.rendering_engine-trident.min' [1:125781] Error in declaration. '*' ist als erstes Zeichen einer Property nicht erlaubt.
Okt 22, 2015 10:23:38 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
WARNUNG: CSS error: 'http://z-ecx.images-amazon.com/images/G/01/AUIClients/AmazonUI-2215197d18a3d0e321eb1a67a8b9e87ba4b4ab20._V2_.css#AUIClients/AmazonUI.rendering_engine-trident.min' [1:125797] Error in declaration. '*' ist als erstes Zeichen einer Property nicht erlaubt.
Okt 22, 2015 10:23:38 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
WARNUNG: CSS error: 'http://z-ecx.images-amazon.com/images/G/01/AUIClients/AmazonGatewayAuiAssets-3d5b6f366e05fa5c0b2f38dca7366948b0599a7b._V2_.css#AUIClients/AmazonGatewayAuiAssets.weblab-GW_NOT_INTERESTED_48787-C.min' [1:8806] Fehler in Ausdruck; ':' nach dem identifier "progid" gefunden.
Okt 22, 2015 10:23:38 PM com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
WARNUNG: CSS error: 'http://z-ecx.images-amazon.com/images/G/01/AUIClients/AmazonGatewayAuiAssets-3d5b6f366e05fa5c0b2f38dca7366948b0599a7b._V2_.css#AUIClients/AmazonGatewayAuiAssets.weblab-GW_NOT_INTERESTED_48787-C.min' [1:8942] Fehler in Ausdruck; ':' nach dem identifier "progid" gefunden.
Okt 22, 2015 10:23:38 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNUNG: Obsolete content type encountered: 'application/x-javascript'.
Okt 22, 2015 10:23:38 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNUNG: Obsolete content type encountered: 'application/x-javascript'
通过尝试名为“csgolounge.com”的网站,它甚至更多:
Okt 22, 2015 10:32:46 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNUNG: Obsolete content type encountered: 'text/javascript'.
Okt 22, 2015 10:32:47 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNUNG: Obsolete content type encountered: 'application/x-javascript'.
Okt 22, 2015 10:32:47 PM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError
SCHWERWIEGEND: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://csgolounge.com/script/jquery.min.js?1423740933] line=[2] lineSource=[null] lineOffset=[0]
Okt 22, 2015 10:32:47 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNUNG: Obsolete content type encountered: 'application/x-javascript'.
Okt 22, 2015 10:32:48 PM com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
WARNUNG: Obsolete content type encountered: 'text/javascript'.
Exception in thread "main" ======= EXCEPTION START ========
Exception class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException]
com.gargoylesoftware.htmlunit.ScriptException: TagError: adsbygoogle.push() error: No slot size for availableWidth=0 (http://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js#4)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:865)
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:513)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:747)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:722)
at com.gargoylesoftware.htmlunit.html.HtmlPage.executeJavaScriptIfPossible(HtmlPage.java:945)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeInlineScriptIfNeeded(HtmlScript.java:351)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:411)
at com.gargoylesoftware.htmlunit.html.HtmlScript$3.execute(HtmlScript.java:270)
at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:290)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:800)
at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:757)
at org.cyberneko.html.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1170)
at org.cyberneko.html.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1072)
at org.cyberneko.html.filters.DefaultFilter.endElement(DefaultFilter.java:206)
at org.cyberneko.html.filters.NamespaceBinder.endElement(NamespaceBinder.java:330)
at org.cyberneko.html.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3126)
at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2093)
at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:920)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:499)
at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:452)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:1040)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:253)
at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:199)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:272)
at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:160)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:476)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:350)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:415)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:400)
at Internet.Converter2.htmlToString(Converter2.java:13)
at main.mainMethod.main(mainMethod.java:8)
Caused by: net.sourceforge.htmlunit.corejs.javascript.JavaScriptException: [object Object] (http://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js#4)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1006)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:798)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:411)
at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:310)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3286)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:738)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:850)
... 33 more
JavaScriptException value = [object Object]
======= EXCEPTION END ========
我真的不明白,它试图告诉我什么。我迷路了。有人能帮助我吗?
答案 0 :(得分:2)
您只能在获取网址时执行JavaScript。 JavaScript由浏览器运行,而不是服务器本身。