这是我的MyCrawler.java中的以下代码,它正在抓取我在href.startsWith
中提供的所有链接但是假设如果我不想抓取此特定页面http://inv.somehost.com/people/index.html
那么怎么能我在我的代码中这样做..
public MyCrawler() {
}
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
if (href.startsWith("http://www.somehost.com/") || href.startsWith("http://inv.somehost.com/") || href.startsWith("http://jo.somehost.com/")) {
//And If I do not want to crawl this page http://inv.somehost.com/data/index.html then how it can be done..
return true;
}
return false;
}
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String text = page.getText();
List<WebURL> links = page.getURLs();
int parentDocid = page.getWebURL().getParentDocid();
try {
URL url1 = new URL(url);
System.out.println("URL:- " +url1);
URLConnection connection = url1.openConnection();
Map responseMap = connection.getHeaderFields();
Iterator iterator = responseMap.entrySet().iterator();
while (iterator.hasNext())
{
String key = iterator.next().toString();
if (key.contains("text/html") || key.contains("text/xhtml"))
{
System.out.println(key);
// Content-Type=[text/html; charset=ISO-8859-1]
if (filters.matcher(key) != null){
System.out.println(url1);
try {
final File parentDir = new File("crawl_html");
parentDir.mkdir();
final String hash = MD5Util.md5Hex(url1.toString());
final String fileName = hash + ".txt";
final File file = new File(parentDir, fileName);
boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt
System.out.println("hash:-" + hash);
System.out.println(file);
// Create file if it does not exist
// File did not exist and was created
FileOutputStream fos = new FileOutputStream(file, true);
PrintWriter out = new PrintWriter(fos);
// Also could be written as follows on one line
// Printwriter out = new PrintWriter(new FileWriter(args[0]));
// Write text to file
Tika t = new Tika();
String content= t.parseToString(new URL(url1.toString()));
out.println("===============================================================");
out.println(url1);
out.println(key);
//out.println(success);
out.println(content);
out.println("===============================================================");
out.close();
fos.flush();
fos.close();
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (TikaException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// http://google.com
}
}
}
} catch (MalformedURLException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
System.out.println("=============");
}
这是我调用MyCrawler的Controller.java代码..
public class Controller {
public static void main(String[] args) throws Exception {
CrawlController controller = new CrawlController("/data/crawl/root");
controller.addSeed("http://www.somehost.com/");
controller.addSeed("http://inv.somehost.com/");
controller.addSeed("http://jo.somehost.com/");
controller.start(MyCrawler.class, 20);
controller.setPolitenessDelay(200);
controller.setMaximumCrawlDepth(2);
}
}
任何建议都将受到赞赏..
答案 0 :(得分:1)
如何添加属性以告知要排除哪些网址。
将您不希望他们抓取的所有网页添加到您的排除列表中。
以下是一个例子:
public class MyCrawler extends WebCrawler {
List<Pattern> exclusionsPatterns;
public MyCrawler() {
exclusionsPatterns = new ArrayList<Pattern>();
//Add here all your exclusions using Regular Expresssions
exclusionsPatterns.add(Pattern.compile("http://investor\\.somehost\\.com.*"));
}
/*
* You should implement this function to specify
* whether the given URL should be visited or not.
*/
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
//Iterate the patterns to find if the url is excluded.
for (Pattern exclusionPattern : exclusionsPatterns) {
Matcher matcher = exclusionPattern.matcher(href);
if (matcher.matches()) {
return false;
}
}
if (href.startsWith("http://www.ics.uci.edu/")) {
return true;
}
return false;
}
}
在此示例中,我们告知不应抓取以http://investor.somehost.com
开头的所有网址。
所以这些不会被抓取:
http://investor.somehost.com/index.html
http://investor.somehost.com/something/else
我建议您阅读regular expresions。