所以我正在开发一个应该下载所有图像,文件和网页的webcrawler,然后递归地为找到的所有网页做同样的事情。但是,我似乎有一个逻辑错误。
public class WebCrawler {
private static String url;
private static int maxCrawlDepth;
private static String filePath;
/* Recursive function that crawls all web pages found on a given web page.
* This function also saves elements from the DownloadRepository to disk.
*/
public static void crawling(WebPage webpage, int currentCrawlDepth, int maxCrawlDepth) {
webpage.crawl(currentCrawlDepth);
HashMap<String, WebPage> pages = webpage.getCrawledWebPages();
if(currentCrawlDepth < maxCrawlDepth) {
for(WebPage wp : pages.values()) {
crawling(wp, currentCrawlDepth+1, maxCrawlDepth);
}
}
}
public static void main(String[] args) {
if(args.length != 3) {
System.out.println("Must pass three parameters");
System.exit(0);
}
url = "";
maxCrawlDepth = 0;
filePath = "";
url = args[0];
try {
URL testUrl = new URL(url);
URLConnection urlConnection = testUrl.openConnection();
urlConnection.connect();
} catch (MalformedURLException e) {
System.out.println("Not a valid URL");
System.exit(0);
} catch (IOException e) {
System.out.println("Could not open URL");
System.exit(0);
}
try {
maxCrawlDepth = Integer.parseInt(args[1]);
} catch (NumberFormatException e) {
System.out.println("Argument is not an int");
System.exit(0);
}
filePath = args[2];
File path = new File(filePath);
if(!path.exists()) {
System.out.println("File Path is invalid");
System.exit(0);
}
WebPage webpage = new WebPage(url);
crawling(webpage, 0, maxCrawlDepth);
System.out.println("Web crawl is complete");
}
}
函数crawl会将存储任何找到的图像,文件或链接的网站内容解析为散列图,例如:
public class WebPage implements WebElement {
private static Elements images;
private static Elements links;
private HashMap<String, WebImage> webImages = new HashMap<String, WebImage>();
private HashMap<String, WebPage> webPages = new HashMap<String, WebPage>();
private HashMap<String, WebFile> files = new HashMap<String, WebFile>();
private String url;
public WebPage(String url) {
this.url = url;
}
/* The crawl method parses the html on a given web page
* and adds the elements of the web page to the Download
* Repository.
*/
public void crawl(int currentCrawlDepth) {
System.out.print("Crawling " + url + " at crawl depth ");
System.out.println(currentCrawlDepth + "\n");
Document doc = null;
try {
HttpConnection httpConnection = (HttpConnection) Jsoup.connect(url);
httpConnection.ignoreContentType(true);
doc = httpConnection.get();
} catch (MalformedURLException e) {
System.out.println(e.getLocalizedMessage());
} catch (IOException e) {
System.out.println(e.getLocalizedMessage());
} catch (IllegalArgumentException e) {
System.out.println(url + "is not a valid URL");
}
DownloadRepository downloadRepository = DownloadRepository.getInstance();
if(doc != null) {
images = doc.select("img");
links = doc.select("a[href]");
for(Element image : images) {
String imageUrl = image.absUrl("src");
if(!webImages.containsValue(image)) {
WebImage webImage = new WebImage(imageUrl);
webImages.put(imageUrl, webImage);
downloadRepository.addElement(imageUrl, webImage);
System.out.println("Added image at " + imageUrl);
}
}
HttpConnection mimeConnection = null;
Response mimeResponse = null;
for(Element link: links) {
String linkUrl = link.absUrl("href");
linkUrl = linkUrl.trim();
if(!linkUrl.contains("#")) {
try {
mimeConnection = (HttpConnection) Jsoup.connect(linkUrl);
mimeConnection.ignoreContentType(true);
mimeConnection.ignoreHttpErrors(true);
mimeResponse = (Response) mimeConnection.execute();
} catch (Exception e) {
System.out.println(e.getLocalizedMessage());
}
String contentType = null;
if(mimeResponse != null) {
contentType = mimeResponse.contentType();
}
if(contentType == null) {
continue;
}
if(contentType.toString().equals("text/html")) {
if(!webPages.containsKey(linkUrl)) {
WebPage webPage = new WebPage(linkUrl);
webPages.put(linkUrl, webPage);
downloadRepository.addElement(linkUrl, webPage);
System.out.println("Added webPage at " + linkUrl);
}
}
else {
if(!files.containsValue(link)) {
WebFile webFile = new WebFile(linkUrl);
files.put(linkUrl, webFile);
downloadRepository.addElement(linkUrl, webFile);
System.out.println("Added file at " + linkUrl);
}
}
}
}
}
System.out.print("\nFinished crawling " + url + " at crawl depth ");
System.out.println(currentCrawlDepth + "\n");
}
public HashMap<String, WebImage> getImages() {
return webImages;
}
public HashMap<String, WebPage> getCrawledWebPages() {
return webPages;
}
public HashMap<String, WebFile> getFiles() {
return files;
}
public String getUrl() {
return url;
}
@Override
public void saveToDisk(String filePath) {
System.out.println(filePath);
}
}
使用hashmap的目的是确保我不会多次解析同一个网站。错误似乎与我的递归有关。有什么问题?
开始抓取的一些示例输出Crawling https://www.google.com/ at crawl depth 0
Added webPage at http://www.google.com/intl/en/options/
Added webPage at https://www.google.com/intl/en/ads/
Added webPage at https://www.google.com/services/
Added webPage at https://www.google.com/intl/en/about.html
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/ at crawl depth 0
Crawling https://www.google.com/services/ at crawl depth 1
Added webPage at http://www.google.com/intl/en/enterprise/apps/business/?utm_medium=et&utm_campaign=en&utm_source=us-en-et-nelson_bizsol
Added webPage at https://www.google.com/services/sitemap.html
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/services/ at crawl depth 1
**Crawling https://www.google.com/intl/en/policies/ at crawl depth 2**
Added webPage at https://www.google.com/intl/en/policies/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/privacy/
Added webPage at https://www.google.com/intl/en/policies/terms/
Added webPage at https://www.google.com/intl/en/policies/faq/
Added webPage at https://www.google.com/intl/en/policies/technologies/
Added webPage at https://www.google.com/intl/en/about/
Added webPage at https://www.google.com/intl/en/policies/
Finished crawling https://www.google.com/intl/en/policies/ at crawl depth 2
**Crawling https://www.google.com/intl/en/policies/ at crawl depth 3**
请注意,它会两次解析http://www.google.com/intl/en/policies/
答案 0 :(得分:1)
您正在为每个网页创建新地图。这将确保如果页面上出现相同的链接两次,它将只被抓取一次,但它不会处理两个不同页面上出现相同链接的情况。
https://www.google.com/intl/en/policies/
同时出现在https://www.google.com/
和https://www.google.com/services/
。
为避免这种情况,请在整个爬网过程中使用单个映射,并将其作为参数传递给递归。
public class WebCrawler {
private HashMap<String, WebPage> visited = new HashMap<String, WebPage>();
public static void crawling(Map<String, WebPage> visited, WebPage webpage, int currentCrawlDepth, int maxCrawlDepth) {
}
}
当您还拿着图像等的地图时,您可能更喜欢创建一个新对象,也许称之为visited
,并使其保持跟踪。
public class Visited {
private HashMap<String, WebPage> webPages = new HashMap<String, WebPage>();
public boolean visit(String url, WebPage page) {
if (webPages.containsKey(page)) {
return false;
}
webPages.put(url, page);
return true;
}
private HashMap<String, WebImage> webImages = new HashMap<String, WebImage>();
public boolean visit(String url, WebImage image) {
if (webImages.containsKey(image)) {
return false;
}
webImages.put(url, image);
return true;
}
}