我正在尝试使用jsoup和jsoup解析我下载的html页面。 我有一个代码读取html页面并返回html字符串输出,之后我使用bytestream保存到一个文件。 我遇到的问题是解析我下载的所谓html页面,以检索其标题。当我尝试解析它时,我能得到的是带有空内容的标签,如下所示:
<html>
<head></head>
<body>
cores/core/log/DJade_Tmp/tmp__11755159/tmp_yh/tmp_content/998159649.html
</body>
</html>
Here is the content of the html page
:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="/d2/bs/css/bootstrap.min.css" rel="stylesheet">
<title>Java Swing Tutorials</title>
<meta content="Java Swing Programming Tutorials - Online Java tutorials provides swing tutorials, swing example and code, swing example programs, definition of java swing, free java swing tutorials. Also read useful java articles and resouces on java and advanced java programming." name="description">
<meta content="java swing, java swing tutorials, swing example java, online swing tutorials, swing example code java, free java code, java tutorials, online java tutorials, free java tutorials" name="keywords">
<link href="/d1/prettify/prettify.css" type="text/css" rel="stylesheet">
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-6980313-1', 'auto');
ga('send', 'pageview');
</script>
<script async src="//pagead2.googlesyndication.com/pagead/js/adsbygoogle.js"></script>
<script>
(adsbygoogle = window.adsbygoogle || []).push({
google_ad_client: "ca-pub-0714075272818912",
enable_page_level_ads: true
});
</script>
.......
我可以粘贴页面的全部内容导致它太长时间。 正如你所看到的那样,标题就在那里以及页面的其他内容,但我得到的是空标签。
以下是我保存页面的代码:
/************ HERE WE DETERMINE TO DOWNLOAD BY JSOUP OR HTTPURLCONNECTION **********/
if(type.equals("jsd")){
// HERE WE DOWNLOAD USING JSOUP
pageStrings=new String[1][Contents.length];
// Now lets loop to download individual url String
for(int j=0; j<Urls.length; j++){
if(Urls[j].toString().indexOf(".pdf")>=0 || Urls[j].toString().indexOf(".xml")>=0){
pdfList.add(Urls[j].toString());
//pageStrings[1][j]=Urls[j].toString();
}
else{
String sUrl="";
Document doc=null;
Connection con=Jsoup.connect(Urls[j].toString()).timeout(50000).ignoreHttpErrors(true).followRedirects(true).userAgent(userAgent);
Connection.Response resp = con.execute();
// HERE WE CHECK RESPONSE CODE
if (resp.statusCode() == 200) {
doc = con.get();
// Now lets get the text document
pageStrings[0][j]=doc.html();
urlList.add(Urls[j].toString());
//pageStrings[0][j]=Jsoup.connect(Urls[j].toString()).timeout(50000).ignoreHttpErrors(true).userAgent(userAgent).get().html();
} // End of status check
else if(resp.statusCode() == 307){
String sNewUrl = resp.header("Location");
if (sNewUrl != null && sNewUrl.length() > 7)
sUrl = sNewUrl;
resp = Jsoup.connect(sUrl).timeout(50000).ignoreHttpErrors(true).userAgent(userAgent).execute();
doc =resp.parse();
// Now lets get the text document
pageStrings[0][j]=doc.html();
urlList.add(Urls[j].toString());
} // End of status 307 check
}
} // End of loop
// HERE WE CREATE THE META DATA
//System.out.println(Arrays.toString(pageStrings[0])+"------>"+pageStrings[0].length);
// Lets compose list to array
// Now we change data of array
pdfUrlType=pdfList.toArray(new String[pdfList.size()]); // Composed pdf url type
xhtmlTypeUrl=urlList.toArray(new String[urlList.size()]); // Compose url type
xhtmlUrl=new URL[xhtmlTypeUrl.length];
// now lets loop to turn to url
for(int o=0; o<xhtmlTypeUrl.length; o++){
xhtmlUrl[o]=new URL(xhtmlTypeUrl[o]);
} // End of loop
// NOW LETS CREATE META
meta=new MetaWizard(xhtmlUrl, null);
// NOW LETS SAVE THE RETURN HTML TO TMP
if(pageStrings[0].length>0){
downloader=new DjadeTmp(pageStrings[0], "html", "[n]["+meta.metaWriter()+"]");
FileDatas=downloader.save(null);
} // End of page string check for html
页面无法解析的原因是什么?任何帮助将完全赞赏。我使用eclipse来运行我的项目