我是Algos和DS的新手。我需要使用BFS实现一个webcrawler。我到目前为止......但由于我正在使用队列,我无法获得深度。
public void BFS() {
String link = "";
while (mainSet.size() <= 100 && depth < 5) {
if (queue.size() >=1) {
System.out.println(queue);
link = queue.removeFirst();
System.out.println("Link shifted from queue!");
System.out.println(link);
String html = fetchContent(link);
fetchLinks(html);
} else {
System.out.println("Completed!!");
break;
}
}
}
public String fetchContent(String strLink) {
String html = "";
URLConnection connection = null;
Scanner scanner = null;
try {
connection = new URL(strLink).openConnection();
scanner = new Scanner(connection.getInputStream());
scanner.useDelimiter("\\Z");
if (scanner.hasNext()) {
html = scanner.next();
visited.add(strLink);
}
} catch (Exception ex) {
} finally {
if (scanner!= null)
scanner.close();
}
return html;
}
public void fetchLinks(String html) {
Document doc = Jsoup.parse(html);
Elements links = doc.select("a[href]");
for (Element link: links) {
String group = link.attr("href");
if ((!group.contains(".css")) && (!group.contains(".ico")) && (!group.contains(".jpg")) && (!group.contains(" "))
&& (!group.contains(".gif")) && (!group.contains(".pdf")) && (!group.contains(".zip")) && (!group.contains(".asc"))
&& (!group.contains(".rar")) && (!group.contains(".png")) && (!group.contains(".7z"))
&& (!group.contains(".djvu")) && (!group.contains(".chm")) && (!group.contains(".mp3"))
&& (!group.contains(".ogg")) && (!group.contains(".rm")) && (!group.contains(".wav"))
&& (!group.contains("mailto:")) && (!group.contains("#")) && (!group.contains(".xml"))
&& (!group.contains(".js")) && (!group.contains("news:")) && (!group.contains("mail:"))
&& (!group.contains(".txt")) && (!group.contains(".bz2")) && (!group.contains(".gz"))
&& (!group.contains("javascript:")) && (!group.contains("exe")) && (!group.contains("vbs"))) {
group = group.replaceAll("'", "");
group = group.replaceAll("\"", "");
if ((group.indexOf("http") == -1)) {
if (group.charAt(0) != '/') {
group = parent + group;
} else if(group.charAt(0) == '/') {
group = scheme + "://" + authority + group;
}
System.out.println("RelLink: " + group);
mainSet.add(group);
} else if (group.startsWith(parent)) {
System.out.println("SeedLink: " + group);
mainSet.add(group);
}
if (!visited.contains(group)) {
if (group.startsWith(parent)) {
queue.add(group);
}
}
}
}
}
我想按深度限制抓取工具。另外,我想知道如何从队列中删除重复项。
答案 0 :(得分:1)
要限制深度,您可以创建一个封装深度和要获取的页面的类。你甚至可以将一些函数放入该类中:
public class Page {
private final int depth;
private final String url;
public Page(String url, int depth) {
this.url = url;
this.depth = depth;
}
private Set<String> fetchLinks(html) {
// use your implementation, but return the links instead
// of adding them to a queue. Using a set removes duplicates
}
/**
* Fetches the URL represented by this page, and
* add pages to the queue for all pages linked to
* by the page.
*/
public void visitPage(Queue<Page> workQueue) {
String html = fetchContent(url);
if (depth == 5) {
// in too deep!
return;
}
for (String link : fetchLinks(html)) {
workQueue.add(new Page(link, depth + 1));
}
}
}
至于删除重复项,您可以使用LinkedHashSet
而不是Queue
(以防止队列中的重复)或维护Set
个已获取的页面(以防止获取重复项)页面多次)。