所以这就是我的困境,我正在尝试创建一个抓取整个网站的PDF链接的网络抓取工具,但正如下面的代码所示,我只能抓取一个特定的网页而不是整个网站。我希望我的代码要做的是抓取PDF链接的初始URL(它已经做过),然后在整个网站中搜索更多PDF链接。有人能告诉我到底我做错了什么或者我需要添加什么?我真的很感激。
public class Crawler {
/**
* @param args the command line arguments
* @throws java.io.IOException
*/
public static void main(String[] args) throws IOException {
String url = "http://www.tuskegee.edu";
print("Fetching %s...", url);
Document doc = Jsoup.connect(url).timeout(0).get();
Elements media = doc.select("[src]");
Elements imports = doc.select("link[href]");
Elements links1 = doc.select("a[href]");
print("\nMedia: (%d)", media.size());
for (Element src : media) {
if (src.tagName().equals("img"))
print(" * %s: <%s> %sx%s (%s)",
src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"),
trim(src.attr("alt"), 20));
else
print(" * %s: <%s>", src.tagName(), src.attr("abs:src"));
}
print("\nImports: (%d)", imports.size());
for (Element link : imports) {
print(" * %s <%s> (%s)", link.tagName(),link.attr("abs:href"), link.attr("rel"));
}
print("\nLinks: (%d)", links1.size());
for (Element link: links1){
print(" * a: <%s> (%s)", link.attr("abs:href"), trim(link.text(), 35));
}
}
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
private static String trim(String s, int width) {
if (s.length() > width)
return s.substring(0, width-1) + ".";
else
return s;
}
}
答案 0 :(得分:0)
这应该这样做:
public class Crawler {
/**
* @param args the command line arguments
* @throws java.io.IOException
*/
public static void main(String[] args) throws IOException {
Set<String> visitedUrls = new HashSet<>();
String url = "http://www.tuskegee.edu";
crawl(url, visitedUrls);
}
private static void crawl(String url, Set<String> visited) throws IOException {
if(url.isEmpty() || visited.contains(url)) {
return;
}
print("Fetching %s...", url);
visited.add(url);
Document doc;
try {
doc = Jsoup.connect(url).timeout(10000).get();
} catch (UnsupportedMimeTypeException e) {
System.out.println("Unsupported Mime type. Aborting crawling for URL: " + url);
return;
} catch (MalformedURLException e) {
System.out.println("Unsupported protocol for URL: " + url);
return;
} catch (HttpStatusException e) {
System.out.println("Error (status=" + e.getStatusCode() + ") fetching URL: " + url);
return;
} catch (IOException e) {
System.out.println("Timeout fetching URL: " + url);
return;
}
Elements media = doc.select("[src]");
Elements imports = doc.select("link[href]");
Elements links1 = doc.select("a[href]");
print("\nMedia: (%d)", media.size());
for (Element src : media) {
if (src.tagName().equals("img"))
print(" * %s: <%s> %sx%s (%s)",
src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"),
trim(src.attr("alt"), 20));
else
print(" * %s: <%s>", src.tagName(), src.attr("abs:src"));
}
print("\nImports: (%d)", imports.size());
for (Element link : imports) {
print(" * %s <%s> (%s)", link.tagName(),link.attr("abs:href"), link.attr("rel"));
}
print("\nLinks: (%d)", links1.size());
for (Element link: links1){
print(" * a: <%s> (%s)", link.attr("abs:href"), trim(link.text(), 35));
}
for(Element link : links1) {
String href = link.attr("abs:href");
URL hrefURL = null;
try {
hrefURL = new URL(href);
} catch (MalformedURLException e) {
//nothing
}
if(hrefURL != null && hrefURL.getHost().equals(new URL(url).getHost())) {
crawl(href, visited);
}
}
}
private static void print(String msg, Object... args) {
System.out.println(String.format(msg, args));
}
private static String trim(String s, int width) {
if (s.length() > width)
return s.substring(0, width-1) + ".";
else
return s;
}
}
请注意最后一个FOR循环已添加到原始代码中。
编辑:添加了访问过的网址跟踪,以避免在找到访问过的网址时出现无限循环。
EDIT2:添加了一些错误处理和域限制,因为您必须小心,否则您最终可能会抓取整个互联网!
您仍然需要提取您真正想要的内容并将其保存在某处。