我想使用普通Java从网页中提取标题信息。例如,如果页面为www.stackoverflow.com
且路径为/questions
,则程序应从www.stackoverflow.com/questions
返回http标头信息。到目前为止,我有这种方法:
private static String queryWeb(String page, String path) throws IOException {
InetAddress requestedWebIP = InetAddress.getByName(page);
if ((path == null) || (path.equals ("")) {
path = "/";
}
try (
Socket toWebSocket = new Socket(requestedWebIP, 80);
BufferedOutputStream outPutStream = new BufferedOutputStream(toWebSocket.getOutputStream());
BufferedReader inputStream = new BufferedReader(new InputStreamReader(toWebSocket.getInputStream()))
) {
String request = "HEAD " + path + " HTTP/1.1\r\n\r\n";
outPutStream.write(request.getBytes());
outPutStream.flush();
String input;
String result = "";
while (!(input = inputStream.readLine()).equals("")) {
System.out.println(input);
result = result + input + "\n";
}
return result;
} catch (IOException e) {
System.out.println("An error occurred during IO");
e.printStackTrace();
}
return null;
}
这适用于没有其他路径的网页,即www.stackoverflow.com
。但是,每当我尝试www.stackoverflow.com/questions
的任何内容时,我都会在while循环中获得nullpointerException
。使用调试器进行调试表明inputStream为null,但仅在指定路径时才会出现。所以这有效:
HEAD / HTTP/1.1\r\n\r\n
但这不是(?):
HEAD /questions HTTP/1.1\r\n\r\n
所以我假设inpustream是空的,因为HEAD命令失败,但为什么它不接受这种格式?
答案 0 :(得分:2)
您缺少Host
标题:
必须在所有HTTP / 1.1请求消息中发送主机头字段。
我已修改您的代码以发送Host
:
private static String queryWeb(String host, String path) throws IOException {
InetAddress requestedWebIP = InetAddress.getByName(host);
if ((path == null) || (path.equals(""))) {
path = "/";
}
try (
Socket toWebSocket = new Socket(requestedWebIP, 80);
BufferedOutputStream outPutStream = new BufferedOutputStream(toWebSocket.getOutputStream());
BufferedReader inputStream = new BufferedReader(new InputStreamReader(toWebSocket.getInputStream()))
) {
String request = "HEAD " + path + " HTTP/1.1\r\n" +
"Host: " + host + "\r\n\r\n";
outPutStream.write(request.getBytes());
outPutStream.flush();
String input;
String result = "";
while (!(input = inputStream.readLine()).equals("")) {
System.out.println(input);
result = result + input + "\n";
}
return result;
} catch (IOException e) {
System.out.println("An error occurred during IO");
e.printStackTrace();
}
return null;
}
以下代码
queryWeb("example.com", "/");
返回200 OK
,而
queryWeb("example.com", "/questions");
返回404 Not Found
(正如预期的那样)。
www.stackoverflow.com
也有效(它返回重定向到https
版本。)
可怕的例外情况都没有失败。
请注意
URL().openConnection()
也可以避免大量的繁琐工作和错误。< / LI>
醇>