我正在尝试用Java构建Web爬虫,我想知道是否有任何方法可以从给定基本URL的绝对路径获取相对路径。我正在尝试替换同一域下html中的任何绝对路径。
由于http网址包含不安全的字符,我无法使用How to construct a relative path in Java from two absolute paths (or URLs)?中所述的Java URI。
我正在使用jsoup来解析我的html,似乎它能够从相对路径获取绝对路径,但不是相反。
E.g。 在以下html的特定html中,
"http://www.example.com/mysite/base.html"
在base.html的页面源代码中,它可以包含:
'<a href="http://www.example.com/myanothersite/new.html"> Another site of mine </a>
我正在尝试缓存此base.html,并对其进行编辑,使其现在包含:
'<a href="../myanothersite/new.html">Another site of mine</a>
答案 0 :(得分:2)
不同方法,不需要给定的baseUrl并使用更高级的方法。
String sourceUrl = "http://www.example.com/mysite/whatever/somefolder/bar/unsecure!+?#whätyöühäv€it/site.html"; // your current site
String targetUrl = "http://www.example.com/mysite/whatever/otherfolder/other.html"; // the link target
String expectedTarget = "../../../otherfolder/other.html";
String[] sourceElements = sourceUrl.split("/");
String[] targetElements = targetUrl.split("/"); // keep in mind that the arrays are of different length!
StringBuilder uniquePart = new StringBuilder();
StringBuilder relativePart = new StringBuilder();
boolean stillSame = true;
for(int ii = 0; ii < sourceElements.length || ii < targetElements.length; ii++) {
if(ii < targetElements.length && ii < sourceElements.length &&
stillSame && sourceElements[ii].equals(targetElements[ii]) && stillSame) continue;
stillSame = false;
if(targetElements.length > ii)
uniquePart.append("/").append(targetElements[ii]);
if(sourceElements.length > ii +1)
relativePart.append("../");
}
String result = relativePart.toString().substring(0, relativePart.length() -1) + uniquePart.toString();
System.out.println("result: " + result);
答案 1 :(得分:0)
这应该这样做。请记住,您可以通过测量源网址和目标网址的相同程度来计算baseUrl!
String baseUrl = "http://www.example.com/mysite/whatever/"; // the base of your site
String sourceUrl = "http://www.example.com/mysite/whatever/somefolder/bar/unsecure!+?#whätyöühäv€it/site.html"; // your current site
String targetUrl = "http://www.example.com/mysite/whatever/otherfolder/other.html"; // the link target
String expectedTarget = "../../../otherfolder/other.html";
// cut away the base.
if(sourceUrl.startsWith(baseUrl))
sourceUrl = sourceUrl.substring(baseUrl.length());
if(!sourceUrl.startsWith("/"))
sourceUrl = "/" + sourceUrl;
// construct the relative levels up
StringBuilder bar = new StringBuilder();
while(sourceUrl.startsWith("/"))
{
if(sourceUrl.indexOf("/", 1) > 0) {
bar.append("../");
sourceUrl = sourceUrl.substring(sourceUrl.indexOf("/", 1));
} else {
break;
}
System.out.println("foo: " + sourceUrl);
}
// add the unique part of the target
targetUrl = targetUrl.substring(baseUrl.length());
bar.append(targetUrl);
System.out.println("expectation: " + expectedTarget.equals(bar.toString()));
System.out.println("bar: " + bar);