我可以获取使用jsoup下载的页面的URL吗?

时间:2015-03-08 16:28:19

标签: java html url download jsoup

有没有办法可以获取我刚刚下载的网页的网址?不是html页面中包含的链接,而是实际html页面本身的url ?

我试过这样做,

org.jsoup.nodes.Document doc = Jsoup.parse(child, "UTF-8", "");
string url = doc.location();
System.out.println(url);

然后url会返回一个空字符串。

2 个答案:

答案 0 :(得分:1)

假设您下载的页面是Document,只需致电Document.location()即可获取其所提供的网址。如果您传递给Jsoup.connect()的网址是重定向,则Document位置会为您提供最终提供的网址。

答案 1 :(得分:0)

如果您使用WinHTTrack通常会保存网址,但您可以做的是查找连接到网站网址的PHP文件或JavaScript文件。例如,下面的下载网站有几个链接:

<html lang="en-US">

<!-- Mirrored from brigade3.com/ by HTTrack Website Copier/3.x [XR&CO'2014], Sat, 13 Dec 2014 04:02:28 GMT -->
<!-- Added by HTTrack --><meta http-equiv="content-type" content="text/html;charset=UTF-8" /><!-- /Added by HTTrack -->
<head>
    <meta charset="UTF-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
        <meta name=viewport content="width=device-width,initial-scale=1">
        <title>BukkitCloud | Beta Stage</title>
                    <link rel="profile" href="http://gmpg.org/xfn/11" />
    <link rel="pingback" href="xmlrpc.php" />
    <link rel="shortcut icon" type="image/x-icon" href="assets/uploads/2013/12/favico.png">
    <link rel='stylesheet' href='http://fonts.googleapis.com/css?family=Open+Sans:400,800,700italic,700,600italic,600,400italic,300italic,300|Source+Sans+Pro:200,300,400|Lato&amp;subset=latin,latin-ext' type='text/css' />
<link rel="alternate" type="application/rss+xml" title="Brigade &raquo; Feed" href="feed/index.html" />
<link rel="alternate" type="application/rss+xml" title="Brigade &raquo; Comments Feed" href="comments/feed/index.html" />
<link rel='stylesheet' id='rs-settings-css'  href='assets/plugins/revslider/rs-plugin/css/settings.css' type='text/css' media='all' />
<link rel='stylesheet' id='rs-captions-css'  href='assets/plugins/revslider/rs-plugin/css/captions.css' type='text/css' media='all' />
<link rel='stylesheet' id='default_style-css'  href='assets/themes/passage/style.css' type='text/css' media='all' />
<link rel='stylesheet' id='stylesheet-css'  href='assets/themes/passage/css/stylesheet.min.css' type='text/css' media='all' />
<!--[if IE 8]>
<link rel='stylesheet' id='ie8-style-css'  href='http://brigade3.com/assets/themes/passage/css/ie8.min.css' type='text/css' media='all' />
<![endif]-->
<!--[if IE 9]>
<link rel='stylesheet' id='ie9-style-css'  href='http://brigade3.com/assets/themes/passage/css/ie9.min.css' type='text/css' media='all' />
<![endif]-->
<link rel='stylesheet' id='style_dynamic-css'  href='assets/themes/passage/css/style_dynamic.css' type='text/css' media='all' />
<link rel='stylesheet' id='responsive-css'  href='assets/themes/passage/css/responsive.min.css' type='text/css' media='all' />
<link rel='stylesheet' id='style_dynamic_responsive-css'  href='assets/themes/passage/css/style_dynamic_responsive.css' type='text/css' media='all' />
<link rel='stylesheet' id='custom_css-css'  href='assets/themes/passage/css/custom_css.css' type='text/css' media='all' />
<script type='text/javascript' src='http://brigade3.com/wp-includes/js/jquery/jquery.js'></script>
<script type='text/javascript' src='http://brigade3.com/wp-includes/js/jquery/jquery-migrate.min.js'></script>
<script type='text/javascript' src='assets/plugins/revslider/rs-plugin/js/jquery.themepunch.revolution.min.js'></script>
<link rel='prev' title='FEATURES' href='features/index.html' />
<link rel='next' title='CONTACT' href='contact/index.html' />
<link rel='canonical' href='index.html' />
<link rel='shortlink' href='index.html' />
        <style type="text/css">
            .comments-link {
                display: none;
            }
                    </style>...

然后,如您所见,您搜索出可能存在的URL,因此在这种情况下,此URL将链接到JavaScript文件。

<script type='text/javascript' src='http://brigade3.com/wp-includes/js/jquery/jquery.js'></script>
<script type='text/javascript' src='http://brigade3.com/wp-includes/js/jquery/jquery-migrate.min.js'></script>

然后,只需点击http://brigade3.com/wp-includes/js/jquery/jquery.js并将其缩短为http://brigade3.com即可找到网站网址。我希望这就是你的意思!