Question

我一直在研究网络爬虫一段时间了，这个想法很简单，我有一个包含网站列表的SQL表，我有很多线程从表中获取第一个网站并删除它，然后抓取它（以类似的方式）。

代码有点太长，所以我要尝试删除它的一些部分：

 while(true){
    if(!stopped){  
        System.gc();

        Statement stmt;
        String scanned = "scanned";
        if (!scan)scanned = "crawled";
        Connection connection = null;
            try {
            connection = Utils.getConnection();
            } catch (Exception e1) {

            connection.close();
            e1.printStackTrace();
            }
            String name;
            stmt = connection.createStatement();
            ResultSet rs = null;
            boolean next;
            do {
            rs = stmt.executeQuery("select url from websites where "+scanned+" = -1");
            next = rs.next();
            } while (next && Utils.inBlackList(rs.getString(1)));


            if(next){
            name = rs.getString(1);
            stmt.executeUpdate("UPDATE websites SET "+scanned+" = 1 where url = '"+Utils.stripDomainName(name)+"'");
            String backup_name = name;
            name = Utils.checkUrl(name);
            System.out.println(scanned + " of the website :  " + name +" just started by the Thread : " + num);

            // And here is the important part, I think

            CrawlConfig config = new CrawlConfig();
            String ts = Utils.getTime();
            SecureRandom random = new SecureRandom();
            String SessionId = new BigInteger(130, random).toString(32);
            String crawlStorageFolder = "tmp/temp_storageadmin"+SessionId;
            config.setCrawlStorageFolder(crawlStorageFolder);

            config.setPolitenessDelay(Main.POLITENESS_DELAY);
            config.setMaxDepthOfCrawling(Main.MAX_DEPTH_OF_CRAWLING);
            config.setMaxPagesToFetch(Main.MAX_PAGES_TO_FETCH);
            config.setResumableCrawling(Main.RESUMABLE_CRAWLING);
            int numberOfCrawlers = Main.NUMBER_OF_CRAWLERS;
            PageFetcher pageFetcher = new PageFetcher(config);
            RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
            RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);

            try {
                  controller = new CrawlerController(config, pageFetcher, robotstxtServer);
                  controller.addSeed(name);
                  controller.setSeeed(name);
                  controller.setTimestamp(ts);
                  controller.setSessiiid("admin"+num+scan);

                  //Main.crawls.addCrawl("admin"+num+scan, new Crawl(name,"admin"+num+scan,ts));
                 stmt.executeUpdate("DELETE FROM tempCrawl WHERE SessionID = '"+"admin"+num+scan+"'");
                  if (!scan){
                     // Main.crawls.getCrawl("admin"+num+scan).setCrawl(true);

                     stmt.executeUpdate("INSERT INTO tempCrawl (SessionID, url, ts, done, crawledpages, misspelled, english, proper, scan, crawl )"
                        + " VALUES ( '"+"admin"+num+scan+"', '"+name+"', '"+ts+"', false, 0, 0, true, false, "+false+" , "+true+"  )");
                  }else{
                      //Main.crawls.getCrawl("admin"+num+scan).setScan(true);

                     stmt.executeUpdate("INSERT INTO tempCrawl (SessionID, url, ts, done, crawledpages, misspelled, english, proper, scan, crawl )"
                        + " VALUES ( '"+"admin"+num+scan+"', '"+name+"', '"+ts+"', false, 0, 0, true, false, "+true+" , "+false+"  )");
                  }
                  connection.close();
                  controller.start_auto(Crawler.class, numberOfCrawlers, false, scan,num);

            } catch(Exception e){
                      rs.close();
                      connection.close();
                  e.printStackTrace();
              }
            }else{
               rs.close();
               connection.close();
            }  






        //CrawlerController.start_auto(scan, num);

        if (stopping){
        stopped = true;
        stopping = false;
        }
    }}
    } catch (Exception e) {
        e.printStackTrace();
    }

正如您所看到的，每次我创建一个crawlerController，并抓取一个网站等等。

这里的问题是jvm内存堆的大小不断增加。在使用您的Kit Java Profiler分析应用程序之后，我在以下代码行中找到了内存泄漏：

yourKit profiling screenshot

现在这是内存泄漏开始的确切行，这个env变量似乎占用了太多空间，并且在每次操作后不断增加，而操作是独立的。

    Environment env = new Environment(envHome, envConfig);

我真的不知道这个变量做了什么，以及我如何修复它，还有一件事，我确实改变了crawlController的源代码，我认为这可能是相关的。

Answer 1

假设您使用crawler4j作为抓取框架。

每次创建crawl controller时，都会实例化一个新的frontier，它在爬网程序线程之间共享，以管理要爬网的URL队列。此外，还创建了一个所谓的“docIdServer”，它有责任管理是否已在此爬网中处理传入的URL（例如网站）。

此frontier和docIdServer基于内存数据库，environment负责缓存，锁定，日志记录和事务。因此，这个变量会随着时间的推移而增长。

如果将可恢复爬行设置为true，则数据库将以文件模式运行，并且会慢慢增长。

为什么这个env对象的大小会不断增长？

1 个答案: