我正在制作一个抓取工具,并且需要从流中获取数据,无论它是否为200。 CURL正在这样做,以及任何标准浏览器。
以下内容实际上并不会获取请求的内容,即使有一些内容,也会引发http错误状态代码的异常。我想要输出,是否有办法?我更喜欢使用这个库,因为它实际上会执行持久连接,这对于我正在进行的爬行类型来说是完美的。
package test;
import java.net.*;
import java.io.*;
public class Test {
public static void main(String[] args) {
try {
URL url = new URL("http://github.com/XXXXXXXXXXXXXX");
URLConnection connection = url.openConnection();
DataInputStream inStream = new DataInputStream(connection.getInputStream());
String inputLine;
while ((inputLine = inStream.readLine()) != null) {
System.out.println(inputLine);
}
inStream.close();
} catch (MalformedURLException me) {
System.err.println("MalformedURLException: " + me);
} catch (IOException ioe) {
System.err.println("IOException: " + ioe);
}
}
}
工作,谢谢:这就是我想出的 - 就像概念的粗略证明一样:
import java.net.*;
import java.io.*;
public class Test {
public static void main(String[] args) {
//InputStream error = ((HttpURLConnection) connection).getErrorStream();
URL url = null;
URLConnection connection = null;
String inputLine = "";
try {
url = new URL("http://verelo.com/asdfrwdfgdg");
connection = url.openConnection();
DataInputStream inStream = new DataInputStream(connection.getInputStream());
while ((inputLine = inStream.readLine()) != null) {
System.out.println(inputLine);
}
inStream.close();
} catch (MalformedURLException me) {
System.err.println("MalformedURLException: " + me);
} catch (IOException ioe) {
System.err.println("IOException: " + ioe);
InputStream error = ((HttpURLConnection) connection).getErrorStream();
try {
int data = error.read();
while (data != -1) {
//do something with data...
//System.out.println(data);
inputLine = inputLine + (char)data;
data = error.read();
//inputLine = inputLine + (char)data;
}
error.close();
} catch (Exception ex) {
try {
if (error != null) {
error.close();
}
} catch (Exception e) {
}
}
}
System.out.println(inputLine);
}
}
答案 0 :(得分:42)
简单:
URLConnection connection = url.openConnection();
InputStream is = connection.getInputStream();
if (connection instanceof HttpURLConnection) {
HttpURLConnection httpConn = (HttpURLConnection) connection;
int statusCode = httpConn.getResponseCode();
if (statusCode != 200 /* or statusCode >= 200 && statusCode < 300 */) {
is = httpConn.getErrorStream();
}
}
您可以参考Javadoc进行解释。我要处理的最好方法如下:
URLConnection connection = url.openConnection();
InputStream is = null;
try {
is = connection.getInputStream();
} catch (IOException ioe) {
if (connection instanceof HttpURLConnection) {
HttpURLConnection httpConn = (HttpURLConnection) connection;
int statusCode = httpConn.getResponseCode();
if (statusCode != 200) {
is = httpConn.getErrorStream();
}
}
}
答案 1 :(得分:10)
调用openConnection
后,您需要执行以下操作。
将URLConnection转换为HttpURLConnection
调用getResponseCode
如果响应成功,请使用getInputStream,否则使用getErrorStream
(成功测试应为200 <= code < 300
,因为除了200之外,还有有效的HTTP成功代码。)
我正在制作一个抓取工具,并且需要从流中获取数据,无论它是否为200。
请注意,如果代码是4xx或5xx,那么“数据”可能是某种错误页面。
应该做的最后一点是,您应始终尊重“robots.txt”文件...并在抓取/抓取其所有者可能关心。简单地吹嘘GET请求可能会让网站所有者感到烦恼......除非你已经与他们达成某种“安排”。