如何在没有相对URL的情况下转储HTML文档?

时间:2015-12-13 19:51:46

标签: url jsoup absolute

由于stackoverflow.com,我有这个:

Document doc = Jsoup.connect(urlFromUser).userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36").timeout(0).get();

doc.absUrl(urlFromUser);
doc.setBaseUri(urlFromUser);

Elements elements = doc.select("body");
Elements imgElements = doc.select("img");

for (Element element : imgElements) {
    element.attr("src", element.attr("abs:src"));
}

Elements hrefElements = doc.select("a");
for (Element element : hrefElements) {
    element.attr("href", "http://www.some.com/translit/lat2cyr?" + element.attr("abs:href"));
}

Elements linkElements = doc.head().select("link");
for (Element element : linkElements) {
    element.attr("href", element.attr("abs:href"));

    writer.print("");
    manipulateElements(elements);
}

结果是:

<link rel="stylesheet" href="css/windows/windows.css?">

但我需要这个:

<link rel="stylesheet" href="http://DOMAIN.com/css/windows/windows.css?">

我试过了,但它没有解决问题:

String host = uri.getHost();
host = "http://" + host;

writer.print(doc.toString().replaceAll("href=\"/css/", "href=\"" + host + "/css/").replaceAll("/jscript/", host + "/jscript/").replaceAll("/styles/", host + "/styles/").replaceAll("/functions/", host + "/functions/").replaceAll("href=\"/templates/", host + "/templates/").replaceAll("href=\"/plugins/", host + "/plugins/").replaceAll("href=\"css/", "href=\"" + host + "/css/"));
writer.close();

1 个答案:

答案 0 :(得分:1)

要实现目标,您需要自定义JSoup 1.8.3。它会生成绝对网址而不是相对网址。不幸的是,从NodeVisitor开始,这个类是内部的。

您可以尝试编写自定义// Fetch the document. JSoup will set the baseUri for us automatically... Document doc = Jsoup // .connect(urlFromUser) // .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36") // .timeout(0) // .get(); // Turn any url into an absolute url String myTargetedTags = "img, a, link"; for (Element e : doc.select(myTargetedTags)) { switch (e.tagName().toLowerCase()) { case "img": e.attr("src", e.absUrl("src")); break; case "a": e.attr("href", "http://www.some.com/translit/lat2cyr?" + e.absUrl("href")); break; case "link": e.attr("href", e.absUrl("href")); break; default: throw new RuntimeException("Unexpected element:\n" + e.outerHtml()); } } // Print out the final result writer.print(doc.outerHtml()); writer.flush(); // Just to be sure that everything goes out... writer.close(); 实施,但这样做太多了。

另一方面,这是一个解决方法:

// Fetch the document. JSoup will set the baseUri for us automatically...
Document doc = Jsoup
            .parse( //
               "<html><head><link rel=\"stylesheet\" type=\"text/css\" href=\"/css/main.css\"></head><body><img src=\"img/my-image.jpg\"><a href=\"/page/page.html\">an anchor</a></body></html>", //
               "http://localhost");
System.out.println("** BEFORE**\n" + doc.outerHtml());

// Turn any url into an absolute url
// (same lines as above...)

// Print out the final result
System.out.println("\n** AFTER **\n" + doc.outerHtml());

注意:对于大型文档,我不知道此代码的执行情况。

示例代码

** BEFORE **
<html>
 <head>
  <link rel="stylesheet" type="text/css" href="/css/main.css">
 </head>
 <body>
  <img src="img/my-image.jpg">
  <a href="/page/page.html">an anchor</a>
 </body>
</html>

** AFTER **
<html>
 <head>
  <link rel="stylesheet" type="text/css" href="http://localhost/css/main.css">
 </head>
 <body>
  <img src="http://localhost/img/my-image.jpg">
  <a href="http://www.some.com/translit/lat2cyr?http://localhost/page/page.html">an anchor</a>
 </body>
</html>

<强>输出

<customErrors mode="On" defaultRedirect="Error/Index">
  <error statusCode="404"
       redirect="Error/Index/404" />
  <error statusCode="403"
   redirect="Error/Index/403" />
</customErrors>

在JSoup 1.8.3上测试