如何从java中的web目录获取头信息

时间:2018-03-19 18:12:15

标签: java http

我想使用普通Java从网页中提取标题信息。例如,如果页面为www.stackoverflow.com且路径为/questions,则程序应从www.stackoverflow.com/questions返回http标头信息。到目前为止,我有这种方法:

private static String queryWeb(String page, String path) throws IOException {
        InetAddress requestedWebIP = InetAddress.getByName(page);
        if ((path == null) || (path.equals ("")) {
            path = "/";
        }
        try (
                Socket toWebSocket = new Socket(requestedWebIP, 80);
                BufferedOutputStream outPutStream = new BufferedOutputStream(toWebSocket.getOutputStream());
                BufferedReader inputStream = new BufferedReader(new InputStreamReader(toWebSocket.getInputStream()))
        ) {
            String request = "HEAD " + path + " HTTP/1.1\r\n\r\n";
            outPutStream.write(request.getBytes());
            outPutStream.flush();
            String input;
            String result = "";

            while (!(input = inputStream.readLine()).equals("")) {
                System.out.println(input);
                result = result + input + "\n";
            }

            return result;

        } catch (IOException e) {
            System.out.println("An error occurred during IO");
            e.printStackTrace();
        }
        return null;
    }

这适用于没有其他路径的网页,即www.stackoverflow.com。但是,每当我尝试www.stackoverflow.com/questions的任何内容时,我都会在while循环中获得nullpointerException。使用调试器进行调试表明inputStream为null,但仅在指定路径时才会出现。所以这有效:

HEAD / HTTP/1.1\r\n\r\n

但这不是(?):

HEAD /questions HTTP/1.1\r\n\r\n

所以我假设inpustream是空的,因为HEAD命令失败,但为什么它不接受这种格式?

1 个答案:

答案 0 :(得分:2)

您缺少Host标题:

  

必须在所有HTTP / 1.1请求消息中发送主机头字段。

我已修改您的代码以发送Host

private static String queryWeb(String host, String path) throws IOException {
    InetAddress requestedWebIP = InetAddress.getByName(host);
    if ((path == null) || (path.equals(""))) {
        path = "/";
    }
    try (
            Socket toWebSocket = new Socket(requestedWebIP, 80);
            BufferedOutputStream outPutStream = new BufferedOutputStream(toWebSocket.getOutputStream());
            BufferedReader inputStream = new BufferedReader(new InputStreamReader(toWebSocket.getInputStream()))
    ) {
        String request = "HEAD " + path + " HTTP/1.1\r\n" +
                "Host: " + host + "\r\n\r\n";
        outPutStream.write(request.getBytes());
        outPutStream.flush();
        String input;
        String result = "";

        while (!(input = inputStream.readLine()).equals("")) {
            System.out.println(input);
            result = result + input + "\n";
        }

        return result;

    } catch (IOException e) {
        System.out.println("An error occurred during IO");
        e.printStackTrace();
    }
    return null;
}

以下代码

queryWeb("example.com", "/");

返回200 OK,而

queryWeb("example.com", "/questions");

返回404 Not Found(正如预期的那样)。

www.stackoverflow.com也有效(它返回重定向到https版本。)

可怕的例外情况都没有失败。

请注意

  1. 路径必须是%-escaped(我省略了这个)
  2. 通常,使用像Apache HttpComponents HttpClient,google-http-client等一些库更容易(也更安全)。即使是标准的URL().openConnection()也可以避免大量的繁琐工作和错误。< / LI>