Question

我有一个爬虫，它读取了很多HTML源代码网页。我有时会遇到异常：连接超时：连接。

我发送了很多链接，但链接具有相同的开头但不同的结尾，例如我发送的链接如："https://stackoverflow.com/questions/ask"，"https://stackoverflow.com/tags"和另一个链接具有相同的开头"https://stackoverflow.com/..."和等等。

我尝试了两种方法来阅读很多源代码，但两者都很弱。

方法1：

private static String getUrlSource(String link) throws IOException {
    System.setProperty("java.net.useSystemProxies", "true");
    URL url = new URL(link);
    URLConnection connection = url.openConnection();
    String redirect = connection.getHeaderField("Location");
    if (redirect != null) {
        connection = new URL(redirect).openConnection();
        connection.setRequestProperty("User-Agent",
                "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.4; en-US; rv:1.9.2.2) Gecko/20100316 Firefox/3.6.2");
        connection.setUseCaches(false);
        connection.setReadTimeout(30000);
        ;
        connection.setDoOutput(true);
        connection.setRequestProperty("Content-Type", "application/x-java-serialized-object");
    }
    BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream(), "UTF-8"));
    StringBuilder text = new StringBuilder();
    String inputLine;
    System.out.println();
    while ((inputLine = in.readLine()) != null) {
        text.append(inputLine);
        // System.out.println(inputLine);
    }
    in.close();
    return text.toString();
}

。 方法2

 String sourceCode = Jsoup.connect(url).timeout(0).userAgent("Mozilla").get().html();

我有同样的错误，但很少出现。任何人都知道为什么我有这个错误以及如何解决它？有趣的是，当我尝试使用方法2发送较少的链接时，问题通常是消失的。我也尝试关闭防火墙，但没有用。

连接超时：连接Java

0 个答案: