Question

我正在尝试抓取300,000个网址。但是，在尝试从URL检索响应代码时，代码中间的某个位置会挂起。我不确定自从建立连接以来出现了什么问题但是之后问题就出现了。任何建议/指针将不胜感激。此外，有没有办法ping一个网站一段时间，如果它没有响应只是继续下一个？

我根据建议设置了读取超时和建议的请求属性修改了代码。但是，即使现在代码也无法获取响应代码！

以下是我修改过的代码段：

URL url=null;

try
{
    Thread.sleep(8000);
}
catch (InterruptedException e1)
{
    e1.printStackTrace();
}

try
{
    //urlToBeCrawled comes from the database
    url=new URL(urlToBeCrawled);
}
catch (MalformedURLException e)
{
    e.printStackTrace();
    //The code is in a loop,so the use of continue.I apologize for putting code in the catch block.
    continue;
}
HttpURLConnection huc=null;
try
{
    huc = (HttpURLConnection)url.openConnection();

}
catch (IOException e)
{
    e.printStackTrace();
}
try
{
   //Added the request property
    huc.addRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)");
    huc.setRequestMethod("HEAD");

}
catch (ProtocolException e)
{
    e.printStackTrace();
}

huc.setConnectTimeout(1000);
try
{
    huc.connect();

}
catch (IOException e)
{

    e.printStackTrace();
    continue;
}

int responseCode=0;
try
{
    //Sets the read timeout
    huc.setReadTimeout(15000);
    //Code hangs here for some URL which is random in each run
    responseCode = huc.getResponseCode();

}
catch (IOException e)
{
    huc.disconnect();

    e.printStackTrace();
    continue;
}
if (responseCode!=200)
{
    huc.disconnect();
    continue;
}

Answer 1

服务器正在保持连接打开但也没有响应。它甚至可能检测到您正在抓住他们的网站，防火墙或反DDOS工具故意试图让您感到困惑。确保设置了用户代理（如果不这样做，某些服务器会生气）。另外，设置一个读取超时，以便在一段时间后无法读取，它将放弃：

huc.setReadTimeout(15000);

Answer 2

这确实应该使用多线程来完成。 特别是，如果您尝试300,000个网址。我更喜欢thread-pool approach。

其次，您将从更强大的HTTP客户端（例如apache commons http客户端）中获益更多，因为它可以更好地设置用户代理。虽然大多数JRE不允许您使用HttpURLConnection类修改用户代理（它们强制它为您的JDK版本，例如：Java/1.6.0_13将是您的用户代理。）有一些技巧通过调整系统属性来改变这一点，但我从未见过它实际工作。再次使用Apache Commons HTTP库，你不会后悔。

最后你需要一个好的http调试器来最终处理这个问题，你可以使用Fiddler2，只需setup a java proxy to point to fiddler（滚动到关于Java的部分）。

尝试获取响应代码时代码挂起

2 个答案: