有没有办法可以获取我刚刚下载的网页的网址?不是html页面中包含的链接,而是实际html页面本身的url ?
我试过这样做,
org.jsoup.nodes.Document doc = Jsoup.parse(child, "UTF-8", "");
string url = doc.location();
System.out.println(url);
然后url会返回一个空字符串。
答案 0 :(得分:1)
假设您下载的页面是Document
,只需致电Document.location()
即可获取其所提供的网址。如果您传递给Jsoup.connect()
的网址是重定向,则Document
位置会为您提供最终提供的网址。
答案 1 :(得分:0)
如果您使用WinHTTrack
通常会保存网址,但您可以做的是查找连接到网站网址的PHP文件或JavaScript文件。例如,下面的下载网站有几个链接:
<html lang="en-US">
<!-- Mirrored from brigade3.com/ by HTTrack Website Copier/3.x [XR&CO'2014], Sat, 13 Dec 2014 04:02:28 GMT -->
<!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=UTF-8" /><!-- /Added by HTTrack -->
<head>
<meta charset="UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<meta name=viewport content="width=device-width,initial-scale=1">
<title>BukkitCloud | Beta Stage</title>
<link rel="profile" href="http://gmpg.org/xfn/11" />
<link rel="pingback" href="xmlrpc.php" />
<link rel="shortcut icon" type="image/x-icon" href="assets/uploads/2013/12/favico.png">
<link rel='stylesheet' href='http://fonts.googleapis.com/css?family=Open+Sans:400,800,700italic,700,600italic,600,400italic,300italic,300|Source+Sans+Pro:200,300,400|Lato&subset=latin,latin-ext' type='text/css' />
<link rel="alternate" type="application/rss+xml" title="Brigade » Feed" href="feed/index.html" />
<link rel="alternate" type="application/rss+xml" title="Brigade » Comments Feed" href="comments/feed/index.html" />
<link rel='stylesheet' id='rs-settings-css' href='assets/plugins/revslider/rs-plugin/css/settings.css' type='text/css' media='all' />
<link rel='stylesheet' id='rs-captions-css' href='assets/plugins/revslider/rs-plugin/css/captions.css' type='text/css' media='all' />
<link rel='stylesheet' id='default_style-css' href='assets/themes/passage/style.css' type='text/css' media='all' />
<link rel='stylesheet' id='stylesheet-css' href='assets/themes/passage/css/stylesheet.min.css' type='text/css' media='all' />
<!--[if IE 8]>
<link rel='stylesheet' id='ie8-style-css' href='http://brigade3.com/assets/themes/passage/css/ie8.min.css' type='text/css' media='all' />
<![endif]-->
<!--[if IE 9]>
<link rel='stylesheet' id='ie9-style-css' href='http://brigade3.com/assets/themes/passage/css/ie9.min.css' type='text/css' media='all' />
<![endif]-->
<link rel='stylesheet' id='style_dynamic-css' href='assets/themes/passage/css/style_dynamic.css' type='text/css' media='all' />
<link rel='stylesheet' id='responsive-css' href='assets/themes/passage/css/responsive.min.css' type='text/css' media='all' />
<link rel='stylesheet' id='style_dynamic_responsive-css' href='assets/themes/passage/css/style_dynamic_responsive.css' type='text/css' media='all' />
<link rel='stylesheet' id='custom_css-css' href='assets/themes/passage/css/custom_css.css' type='text/css' media='all' />
<script type='text/javascript' src='http://brigade3.com/wp-includes/js/jquery/jquery.js'></script>
<script type='text/javascript' src='http://brigade3.com/wp-includes/js/jquery/jquery-migrate.min.js'></script>
<script type='text/javascript' src='assets/plugins/revslider/rs-plugin/js/jquery.themepunch.revolution.min.js'></script>
<link rel='prev' title='FEATURES' href='features/index.html' />
<link rel='next' title='CONTACT' href='contact/index.html' />
<link rel='canonical' href='index.html' />
<link rel='shortlink' href='index.html' />
<style type="text/css">
.comments-link {
display: none;
}
</style>...
然后,如您所见,您搜索出可能存在的URL,因此在这种情况下,此URL将链接到JavaScript文件。
<script type='text/javascript' src='http://brigade3.com/wp-includes/js/jquery/jquery.js'></script>
<script type='text/javascript' src='http://brigade3.com/wp-includes/js/jquery/jquery-migrate.min.js'></script>
然后,只需点击http://brigade3.com/wp-includes/js/jquery/jquery.js
并将其缩短为http://brigade3.com
即可找到网站网址。我希望这就是你的意思!