我编写了一个简单的递归网络爬虫,以递归方式从网页中获取URL链接。
现在我想找出一种方法来限制爬虫使用深度,但我不知道如何限制爬虫的特定深度(我可以通过前N个链接限制爬虫,但我想限制使用深度)< / p>
For Ex: Depth 2 should fetch Parent links -> children(s) links--> children(s) link
赞赏任何意见。
public class SimpleCrawler {
static Map<String, String> retMap = new ConcurrentHashMap<String, String>();
public static void main(String args[]) throws IOException {
StringBuffer sb = new StringBuffer();
Map<String, String> map = (returnURL("http://www.google.com"));
recursiveCrawl(map);
for (Map.Entry<String, String> entry : retMap.entrySet()) {
sb.append(entry.getKey());
}
}
public static void recursiveCrawl(Map<String, String> map)
throws IOException {
for (Map.Entry<String, String> entry : map.entrySet()) {
String key = entry.getKey();
Map<String, String> recurSive = returnURL(key);
recursiveCrawl(recurSive);
}
}
public synchronized static Map<String, String> returnURL(String URL)
throws IOException {
Map<String, String> tempMap = new HashMap<String, String>();
Document doc = null;
if (URL != null && !URL.equals("") && !retMap.containsKey(URL)) {
System.out.println("Processing==>" + URL);
try {
URL url = new URL(URL);
System.setProperty("http.proxyHost", "proxy");
System.setProperty("http.proxyPort", "port");
doc = Jsoup.connect(URL).get();
if (doc != null) {
Elements links = doc.select("a");
String FinalString = "";
for (Element e : links) {
FinalString = "http:" + e.attr("href");
if (!retMap.containsKey(FinalString)) {
tempMap.put(FinalString, FinalString);
}
}
}
} catch (Exception e) {
e.printStackTrace();
}
retMap.put(URL, URL);
} else {
System.out.println("****Skipping URL****" + URL);
}
return tempMap;
}
}
EDIT1:
我想过使用worklist,因此修改了代码。我也不确定如何在这里设置深度(我可以设置要抓取但不完全是深度的网页数量)。任何建议将不胜感激。
public void startCrawl(String url) {
while (this.pagesVisited.size() < 2) {
String currentUrl;
SpiderLeg leg = new SpiderLeg();
if (this.pagesToVisit.isEmpty()) {
currentUrl = url;
this.pagesVisited.add(url);
} else {
currentUrl = this.nextUrl();
}
leg.crawl(currentUrl);
System.out.println("pagesToVisit Size" + pagesToVisit.size());
// SpiderLeg
this.pagesToVisit.addAll(leg.getLinks());
}
System.out.println("\n**Done** Visited " + this.pagesVisited.size()
+ " web page(s)");
}
答案 0 :(得分:2)
基于非递归方法:
保留pagesToCrawl
CrawlURL
的工作清单
class CrawlURL {
public String url;
public int depth;
public CrawlURL(String url, int depth) {
this.url = url;
this.depth = depth;
}
}
最初(在进入循环之前):
Queue<CrawlURL> pagesToCrawl = new LinkedList<>();
pagesToCrawl.add(new CrawlURL(rootUrl, 0)); //rootUrl is the url to start from
现在循环:
while (!pagesToCrawl.isEmpty()) { // will proceed at least once (for rootUrl)
CrawlURL currentUrl = pagesToCrawl.remove();
//analyze the url
//updated with crawled links
}
并使用链接进行更新:
if (currentUrl.depth < 2) {
for (String url : leg.getLinks()) { // referring to your analysis result
pagesToCrawl.add(new CrawlURL(url, currentUrl.depth + 1));
}
}
您可以使用其他元数据(例如链接名称,引荐来源等)来增强CrawlURL。
<强>替代:强>
在我的上层评论中,我提到了一代人的方法。它比这个复杂一点。基本的想法是将列表(currentPagesToCrawl
和futurePagesToCrawl
)与生成变量一起保存(从0开始,每次currentPagesToCrawl
变空时都会增加)。所有已抓取的网址都会放入futurePagesToCrawl
队列,如果currentPagesToCrawl
,则会切换两个列表。这样做直到生成变量达到2。
答案 1 :(得分:1)
您可以在递归方法的签名上添加深度参数,例如
在您的主要
上recursiveCrawl(map,0);
和
public static void recursiveCrawl(Map<String, String> map, int depth) throws IOException {
if (depth++ < DESIRED_DEPTH) //assuming initial depth = 0
for (Map.Entry<String, String> entry : map.entrySet()) {
String key = entry.getKey();
Map<String, String> recurSive = returnURL(key);
recursiveCrawl(recurSive, depth);
}
}
]
答案 2 :(得分:0)
您可以这样做:
static int maxLevels = 10;
public static void main(String args[]) throws IOException {
...
recursiveCrawl(map,0);
...
}
public static void recursiveCrawl(Map<String, String> map, int level) throws IOException {
for (Map.Entry<String, String> entry : map.entrySet()) {
String key = entry.getKey();
Map<String, String> recurSive = returnURL(key);
if (level < maxLevels) {
recursiveCrawl(recurSive, ++level);
}
}
}
此外,您可以使用Set
代替Map
。