不要抓取特定链接中的某些页面(从抓取中排除某些网址)

时间:2011-07-13 17:49:18

标签: java web-crawler

这是我的MyCrawler.java中的以下代码,它正在抓取我在href.startsWith中提供的所有链接但是假设如果我不想抓取此特定页面http://inv.somehost.com/people/index.html那么怎么能我在我的代码中这样做..

public MyCrawler() {
    }

    public boolean shouldVisit(WebURL url) {

        String href = url.getURL().toLowerCase();


    if (href.startsWith("http://www.somehost.com/") || href.startsWith("http://inv.somehost.com/") || href.startsWith("http://jo.somehost.com/")) {
//And If I do not want to crawl this page http://inv.somehost.com/data/index.html then how it can be done..


                    return true;
                }
                return false;
            }


    public void visit(Page page) {

        int docid = page.getWebURL().getDocid();

        String url = page.getWebURL().getURL();         
        String text = page.getText();
        List<WebURL> links = page.getURLs();
        int parentDocid = page.getWebURL().getParentDocid();

        try {
            URL url1 = new URL(url);
            System.out.println("URL:- " +url1);
            URLConnection connection = url1.openConnection();

            Map responseMap = connection.getHeaderFields();
            Iterator iterator = responseMap.entrySet().iterator();
            while (iterator.hasNext())
            {
                String key = iterator.next().toString();

                if (key.contains("text/html") || key.contains("text/xhtml"))
                {
                    System.out.println(key);
                    // Content-Type=[text/html; charset=ISO-8859-1]
                    if (filters.matcher(key) != null){
                        System.out.println(url1);
                        try {
                            final File parentDir = new File("crawl_html");
                            parentDir.mkdir();
                            final String hash = MD5Util.md5Hex(url1.toString());
                            final String fileName = hash + ".txt";
                            final File file = new File(parentDir, fileName);
                            boolean success =file.createNewFile(); // Creates file crawl_html/abc.txt


                             System.out.println("hash:-"  + hash);

                                    System.out.println(file);
                            // Create file if it does not exist



                                // File did not exist and was created
                                FileOutputStream fos = new FileOutputStream(file, true);

                                PrintWriter out = new PrintWriter(fos);

                                // Also could be written as follows on one line
                                // Printwriter out = new PrintWriter(new FileWriter(args[0]));

                                            // Write text to file
                                Tika t = new Tika();
                                String content= t.parseToString(new URL(url1.toString()));


                                out.println("===============================================================");
                                out.println(url1);
                                out.println(key);
                                //out.println(success);
                                out.println(content);

                                out.println("===============================================================");
                                out.close();
                                fos.flush();
                                fos.close();



                        } catch (FileNotFoundException e) {
                            // TODO Auto-generated catch block
                            e.printStackTrace();
                        } catch (IOException e) {
                            // TODO Auto-generated catch block

                            e.printStackTrace();
                        } catch (TikaException e) {
                            // TODO Auto-generated catch block
                            e.printStackTrace();
                        }


                        // http://google.com
                    }
                }


            }



        } catch (MalformedURLException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }



        System.out.println("=============");
    }

这是我调用MyCrawler的Controller.java代码..

public class Controller {
    public static void main(String[] args) throws Exception {
            CrawlController controller = new CrawlController("/data/crawl/root");
            controller.addSeed("http://www.somehost.com/");
            controller.addSeed("http://inv.somehost.com/");
            controller.addSeed("http://jo.somehost.com/");
            controller.start(MyCrawler.class, 20);  
            controller.setPolitenessDelay(200);
            controller.setMaximumCrawlDepth(2);
    }
}

任何建议都将受到赞赏..

1 个答案:

答案 0 :(得分:1)

如何添加属性以告知要排除哪些网址。

将您不希望他们抓取的所有网页添加到您的排除列表中。

以下是一个例子:

public class MyCrawler extends WebCrawler {


        List<Pattern> exclusionsPatterns;

        public MyCrawler() {
            exclusionsPatterns = new ArrayList<Pattern>();
            //Add here all your exclusions using Regular Expresssions
            exclusionsPatterns.add(Pattern.compile("http://investor\\.somehost\\.com.*"));
        }

        /*
         * You should implement this function to specify
         * whether the given URL should be visited or not.
         */
        public boolean shouldVisit(WebURL url) {
                String href = url.getURL().toLowerCase();

                //Iterate the patterns to find if the url is excluded.
               for (Pattern exclusionPattern : exclusionsPatterns) {
                   Matcher matcher = exclusionPattern.matcher(href);
                   if (matcher.matches()) {
                      return false;
                   }
               }

                if (href.startsWith("http://www.ics.uci.edu/")) {
                        return true;
                }
                return false;
        }
}

在此示例中,我们告知不应抓取以http://investor.somehost.com开头的所有网址。

所以这些不会被抓取:

http://investor.somehost.com/index.html
http://investor.somehost.com/something/else

我建议您阅读regular expresions