Question

我正在编写小应用程序，使用URL按Depth-first search扫描所有网页。所以我应该联系很多。在n页面后，我通常会抓取SocketTimeoutException并且我的应用程序崩溃了。那么哪种方法可以避免这种情况呢？也许增加time out或其他什么？这就是我使用递归的方式：

public static ArrayList<String> getResponse(String url) throws IOException {
        ArrayList<String> resultList = new ArrayList<>();
        try {
            Document doc = Jsoup.connect(url).get();
            Elements links = doc.select("a");
            int j = 0;

            for (int i = 0; i < links.size(); i++) {
                if (links.get(i).attr("abs:href").contains("http")) {
                    resultList.add(j, links.get(i).attr("abs:href"));
                    j++;
                }
            }
            return resultList;
        } catch (HttpStatusException e) {

            resultList.add(0, "");
            return resultList;
        } catch (SocketTimeoutException e) {
            getResponse(url);
        }
        return resultList;
    }

它应该发送请求，直到没有SocketTimeoutException。我是对的吗？

Answer 1

我会稍微改变一下程序：

public static ArrayList<String> getResponse(String url) throws IOException {
    return getResponse(ulr, 3);
} 

private static ArrayList<String> getResponse(String url, int retryCount) throws IOException {
    ArrayList<String> resultList = new ArrayList<>();
    if (retryCount <= 0){
        //fail gracefully
        resultList.add(0, "");
        return resultList;
    }
    retryCount--;
    try {
        Document doc = Jsoup.connect(url).timeout(10000).get();
        Elements links = doc.select("a");
        int j = 0;

        for (int i = 0; i < links.size(); i++) {
            if (links.get(i).attr("abs:href").contains("http")) {
                resultList.add(j, links.get(i).attr("abs:href"));
                j++;
            }
        }
        return resultList;
    } catch (HttpStatusException e) {

        resultList.add(0, "");
        return resultList;
    } catch (SocketTimeoutException e) {

        getResponse(url, retryCount);
    }
    return resultList;
}

这会将每次连接的超时设置为10秒。超时（0）将永远等待。然而这是危险的，因为你实际上可能永远不会完成你的日常工作这取决于您实际可以到达网址的确定程度。

第二种机制可以避免无限递归，这可能是你的程序失败的原因。移交计数器，只有在计数器大于0时重试才能解决问题。

Answer 2

有些事情似乎很奇怪 - 没有深入挖掘。（a）你用什么“j”？（b）看起来你正在为每个请求打开一个新套接字（Jsoup.connect（url）），但它看起来好像你没有关闭套接字。鉴于递归，你可能会同时打开大量的套接字，而最早的套接字肯定会超时并最终关闭自己。所以我建议作为第一关：

关闭所有使用的套接字，一旦完成，
考虑以某种方式限制搜索的深度，因此您最终无法使用数千个打开的套接字。大多数系统不能同时有效地处理超过几百个开放套接字。

我认为您需要在连接对象上调用“execute（）”来实际执行“get（）”;不确定这是否与您的问题有关。

哪种方法可以避免SocketTimeoutException？

2 个答案: