我解析特定的Url并尝试在allInnerLinks ArrayList和allExternalLinks ArrayList中保存所有外部URL(同一个域)。
public void go() {
Document doc;
baseUrl = CountLinks.result3;
try {
// need http protocol
doc = Jsoup
.connect(url)
.userAgent(
"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
.referrer("http://www.google.com").timeout(1000 * 5)
.ignoreContentType(true).get();
// get page title
String title = doc.title();
// get all links
Elements links = doc.select("a[href]");
for (Element link : links) {
// !!!
// String absUrl = link.absUrl("href");
String absUrl = link.attr("abs:href");
// get the value from href attribute
if (absUrl.contains(baseUrl)
&& !(absUrl.contains("mailto"))) {
allInnerLinks.add(absUrl);
allInnerLinksCounter++;
} else {
allExternalLinks.add(absUrl);
allExternalLinksCounter++;
}
}
} catch (NullPointerException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (HttpStatusException e) {
e.printStackTrace();
System.out.println(e.getUrl());
} catch (IOException e) {
e.printStackTrace();
}
}
但最后我有重复的元素。链接末尾会显示相同的URL,但数字符号#。我无法理解我是如何得到的:
PAGEURL EXTERNAL URLS
----------------------------------------------------------------------------------------
http://hostingmaks.com/category/news/ https://meetings.webex.com/
http://hostingmaks.com/category/news/# https://meetings.webex.com/
出现这种情况的原因是什么?
答案 0 :(得分:0)
我在下面写了一个简单的方法来检查是否有一个尾随的hashtag /井号/符号返回一个布尔值。
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public boolean hasHashTag(String url) {
int index = url.lastIndexOf("#");
if(index == -1) {
return false;
} else {
Pattern p = Pattern.compile("[^a-z0-9 ]", Pattern.CASE_INSENSITIVE);
Matcher m = p.matcher(url.substring(index+1));
System.out.println(url.substring(index+1) + " "+ (index + 1));
return !m.find();
}
}
您现在可以使用此方法过滤掉重复项。
if(hasHashTag(URLHERE)) {
//don't add to urls to search
} else {
//add url to search
}