我使用jsoup来清理html页面,问题是当我在本地保存html时,图片不会显示,因为它们都是相对链接。
以下是一些示例代码:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class so2 {
public static void main(String[] args) {
String html = "<html><head><title>The Title</title></head>"
+ "<body><p><a href=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" target=\"_blank\"><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></a></p></body></html>";
Document doc = Jsoup.parse(html,"https://whatever.com"); // baseUri seems to be ignored??
System.out.println(doc);
}
}
输出:
<html>
<head>
<title>The Title</title>
</head>
<body>
<p><a href="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif" target="_blank"><img width="437" src="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif" height="418" class="documentimage"></a></p>
</body>
</html>
输出仍然显示链接为a href="/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif"
。
我希望将它们转换为a href="http://whatever.com/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif"
有谁能告诉我如何让jsoup将所有链接转换为绝对链接?
答案 0 :(得分:5)
您可以使用Element.absUrl()
代码中的示例:
编辑(添加图像处理)
public static void main(String[] args) {
String html = "<html><head><title>The Title</title></head>"
+ "<body><p><a href=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" target=\"_blank\"><img width=\"437\" src=\"/data/abstract/ash/2014/5/9/Paper_69295_abstract_120490_0.gif\" height=\"418\" class=\"documentimage\"></a></p></body></html>";
Document doc = Jsoup.parse(html,"https://whatever.com");
Elements select = doc.select("a");
for (Element e : select){
// baseUri will be used by absUrl
String absUrl = e.absUrl("href");
e.attr("href", absUrl);
}
//now we process the imgs
select = doc.select("img");
for (Element e : select){
e.attr("src", e.absUrl("src"));
}
System.out.println(doc);
}