Computer crash, when i use recursion to check all html links and sublinks of web

时间:2017-04-30 02:03:55

标签: java html recursion web crash

i'm tasked to iterate over all links+sublinks of the given web portal. In most cases , when the web pages are not too complex and big i dont have any problems. The problem starts when i check links of a really complex site such as tutorialspoint and my computer just crash. I can't find any performance issue in code i attached, so can someone experienced tell me where in my code is a possible threat, where my computer crashes?

uniqueLinks collection is a HashSet for best perfomance for using contains.

private void recursiveLinkSearch(String webPage) {
        /** ignore pdf**/
        try {
            logger.info(webPage);
            uniqueLinks.add(webPage);
            Document doc = Jsoup.connect(webPage).get();
            doc.select("a").forEach(record->{
                String url=record.absUrl("href");
                if(!uniqueLinks.contains(url)) {
                    /** this would not allow me to to recursively acces to link from other domain **/
                    if(url.contains(getWebPortalDomain())) {
                        recursiveLinkSearch(url);
                    }
                }
            });
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

1 个答案:

答案 0 :(得分:1)

我假设您字面意思是您的计算机崩溃了。我认为你的意思是你的应用程序崩溃了,我希望它是由StackOverflowError引起的。

Java中的递归有一个基本的限制。如果一个线程过于递归,它将填满它的堆栈,你得到一个StackOverflowError。您可以通过使用更大的线程堆栈来解决此问题(在某些情况下),但这只适用于某一点。

在这种情况下,您应该做的是将递归问题转变为迭代问题。例如:

  1. 使用数据结构来保存等待处理的URL队列。
  2. 处理页面并找到需要处理的其他页面的链接时,请将链接添加到队列中。
  3. 执行此操作的简单方法是使用具有有界工作池的ExecutorService。这也照顾了队列管理。