Question

我想使用普通Java从网页中提取标题信息。例如，如果页面为www.stackoverflow.com且路径为/questions，则程序应从www.stackoverflow.com/questions返回http标头信息。到目前为止，我有这种方法：

private static String queryWeb(String page, String path) throws IOException {
        InetAddress requestedWebIP = InetAddress.getByName(page);
        if ((path == null) || (path.equals ("")) {
            path = "/";
        }
        try (
                Socket toWebSocket = new Socket(requestedWebIP, 80);
                BufferedOutputStream outPutStream = new BufferedOutputStream(toWebSocket.getOutputStream());
                BufferedReader inputStream = new BufferedReader(new InputStreamReader(toWebSocket.getInputStream()))
        ) {
            String request = "HEAD " + path + " HTTP/1.1\r\n\r\n";
            outPutStream.write(request.getBytes());
            outPutStream.flush();
            String input;
            String result = "";

            while (!(input = inputStream.readLine()).equals("")) {
                System.out.println(input);
                result = result + input + "\n";
            }

            return result;

        } catch (IOException e) {
            System.out.println("An error occurred during IO");
            e.printStackTrace();
        }
        return null;
    }

这适用于没有其他路径的网页，即www.stackoverflow.com。但是，每当我尝试www.stackoverflow.com/questions的任何内容时，我都会在while循环中获得nullpointerException。使用调试器进行调试表明inputStream为null，但仅在指定路径时才会出现。所以这有效：

HEAD / HTTP/1.1\r\n\r\n

但这不是（？）：

HEAD /questions HTTP/1.1\r\n\r\n

所以我假设inpustream是空的，因为HEAD命令失败，但为什么它不接受这种格式？

Answer 1

您缺少Host标题：

必须在所有HTTP / 1.1请求消息中发送主机头字段。

我已修改您的代码以发送Host：

private static String queryWeb(String host, String path) throws IOException {
    InetAddress requestedWebIP = InetAddress.getByName(host);
    if ((path == null) || (path.equals(""))) {
        path = "/";
    }
    try (
            Socket toWebSocket = new Socket(requestedWebIP, 80);
            BufferedOutputStream outPutStream = new BufferedOutputStream(toWebSocket.getOutputStream());
            BufferedReader inputStream = new BufferedReader(new InputStreamReader(toWebSocket.getInputStream()))
    ) {
        String request = "HEAD " + path + " HTTP/1.1\r\n" +
                "Host: " + host + "\r\n\r\n";
        outPutStream.write(request.getBytes());
        outPutStream.flush();
        String input;
        String result = "";

        while (!(input = inputStream.readLine()).equals("")) {
            System.out.println(input);
            result = result + input + "\n";
        }

        return result;

    } catch (IOException e) {
        System.out.println("An error occurred during IO");
        e.printStackTrace();
    }
    return null;
}

以下代码

queryWeb("example.com", "/");

返回200 OK，而

queryWeb("example.com", "/questions");

返回404 Not Found（正如预期的那样）。

www.stackoverflow.com也有效（它返回重定向到https版本。）

可怕的例外情况都没有失败。

请注意

路径必须是％-escaped（我省略了这个）
通常，使用像Apache HttpComponents HttpClient，google-http-client等一些库更容易（也更安全）。即使是标准的URL().openConnection()也可以避免大量的繁琐工作和错误。< / LI>

如何从java中的web目录获取头信息

1 个答案: