Jsoup.connect(url).get()只返回一半代码

时间:2015-03-29 04:36:03

标签: java android html parsing jsoup

我有一些代码:

String url="http://www.fastvturesults.com/check_new_results/1rn12ec187";
Document doc=Jsoup.connect(url).get();
Log.i("DATA", doc.toString());

我的logcat输出:

I/DATA﹕ <!DOCTYPE html>
<html lang="en">
<head>
<meta name="robots" content="noindex">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta property="og:site_name" content="Fast VTU Results - VTU Students Online Community">
<meta property="og:type" content="article">
<meta property="og:title" content="NISHANTH O(1RN12EC187)">
<meta property="og:description" content="NISHANTH O (1RN12EC187)">
<meta name="author" content="Harish">
<meta http-equiv="content-type" content="text/html;charset=UTF-8">
<script type="text/javascript">
//<![CDATA[
try{if (!window.CloudFlare) {var CloudFlare=[{verbose:0,p:0,byc:0,owlid:"cf",bag2:1,mirage2:0,oracle:0,paths:{cloudflare:"/cdn-cgi/nexp/dok3v=1613a3a185/"},atok:"495d5c7bbce19cd697869e6932b33c4a",petok:"1da02c85fa35bc2e676b85c137d245a01ea1bafe-1427603478-1800",zone:"fastvturesults.com",rocket:"0",apps:{"abetterbrowser":{"ie":"6"}}}];!function(a,b){a=document.createElement("script"),b=document.getElementsByTagName("script")[0],a.async=!0,a.src="//ajax.cloudflare.com/cdn-cgi/nexp/dok3v=919620257c/cloudflare.min.js",b.parentNode.insertBefore(a,b)}()}}catch(e){};
//]]>
</script>
<link rel="shortcut icon" href="http://www.fastvturesults.com/ico/favicon.ico">
<!-- HTML5 shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="assets/js/html5shiv.js"></script>
<script src="assets/js/respond.min.js"></script>
<![endif]-->
<link rel="stylesheet" type="text/css" href="http://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/3.0.3/css/bootstrap.min.css">
<style>
a{
color: #B94A48;
text-decoration: none;
}
.box-red-round{
background-color: #ffffff;
}
#fbPopup{
margin-top: 10%;
}
.navbar-custom {
background-color: #B94A48;
color: #ffffff;
border-radius: 0;
}
.navbar-custom .navbar-nav>li>a {
color: #fff;
}
.navbar-custom .navbar-nav>.active>a
{
color: #ffffff;
background-color: #000000;
}
.navbar-custom .navbar-nav>.active>a:hover,.navbar-custom .navbar-nav>.active>a:focus,.navbar-nav>li:hover,.navbar-nav>li:focus
{
color: #ffffff;
background-color: #000000;
}
.navbar-custom .navbar-brand {
color: #ffffff;
}
.blog-post-image{
float: left !important;
margin: 20px 20px;
}
.mini-nav-div{
background-color: #B94A48;
color: #ffffff;
}
.mini-nav-div a{
color: #ffffff;
}
</style>
<script type="text/javascript">
var jq = document.createElement('script');
jq.type = 'text/javascript';
jq.async = true;
jq.src = '//cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js';
var s = document.getElementsByTagName('script')[0];
s.parentNode.insertBefore(jq, s);
</script>
<title>NISHANTH O(1RN12EC187)</title>
<meta name="description" content="NISHANTH O (1RN12EC187)">
<meta name="keywords" content="NISHANTH O results, NISHANTH O class rank, NISHANTH O university rank,1RN12EC187 results, 1RN12EC187 class rank, 1RN12EC187 university rank">
<script type="text/javascript">
var gb = document.createElement('script');
gb.type = 'text/javascript';
gb.async = true;
gb.src = ('https:' == document.location.protocol

通过页面的源代码,“document.location.protocol”(logcat输出的最后一行)甚至不是源代码的一半。

为什么get()方法只返回网页源头代码的前几行?

2 个答案:

答案 0 :(得分:1)

这对Jsoup来说不是问题。我不知道logcat,但是在HTML代码中的这个位置出现了第一个问号:

document.location.protocol ? 'https://ssl'

所以我猜你的日志记录工作流程中存在一些逃避问题。

顺便说一下,为了避免403 HTTP错误,我不得不设置一个虚假的用户代理,以便用Jsoup获取这个URL:

Document doc = Jsoup.connect(url).userAgent("Mozilla/5.0").get();

答案 1 :(得分:0)

我遇到了同样的问题,JSoup.parse错过了获取一些内容。添加用户代理后,它就解决了。