Question

所以我做了一个可以下载4chan页面的小代码。我得到原始HTML页面并根据我的需要解析它。下面的代码工作正常，但它突然停止工作。当我运行它时服务器不接受我的请求它似乎等待更多的东西。但是我知道HTTP请求如下所示

GET /ck HTTP/1.1
Host: boards.4chan.org
(extra new line)

如果我以任何方式更改此格式，我将恢复“400错误请求”状态代码。但如果我将HTTP / 1.1更改为1.0，服务器响应“200 ok”状态，我得到整个页面。所以这让我觉得错误在主机行中，因为它在HTTP / 1.1中成为强制性的。但我仍然无法弄清楚究竟需要改变什么。

调用函数就是这样，得到一个整板

downloadHTMLThread( "ck", -1);

或者对于特定的线程，只需将-1更改为该数字即可。例如，如下面的链接将具有如下。

//http://boards.4chan.org/ck/res/3507158
//url.getDefaultPort() is 80
//url.getHost() is boards.4chan.org
//url.getFile() is /ck/res/3507158

downloadHTMLThread( "ck", 3507158);

任何建议都将不胜感激，谢谢

public static final String BOARDS = "boards.4chan.org";
public static final String IMAGES = "images.4chan.org";
public static final String THUMBS = "thumbs.4chan.org";
public static final String RES = "/res/";
public static final String HTTP = "http://";
public static final String SLASH = "/";

public String downloadHTMLThread( String board, int thread) {
    BufferedReader reader = null;
    PrintWriter out = null;
    Socket socket = null;
    String str = null;
    StringBuilder input = new StringBuilder();

    try {
        URL url = new URL(HTTP+BOARDS+SLASH+board+(thread==-1?SLASH:RES+thread));
        socket = new Socket( url.getHost(), url.getDefaultPort());
        reader = new BufferedReader( new InputStreamReader( socket.getInputStream()));
        out = new PrintWriter(socket.getOutputStream(), true);

        out.println( "GET " +url.getFile()+ " HTTP/1.1");
        out.println( "HOST: " + url.getHost());
        out.println();

        long start = System.currentTimeMillis();
        while ((str = reader.readLine()) != null) {
            input.append( str).append("\r\n");
        }
        long end = System.currentTimeMillis();

        System.out.println( input);
        System.out.println( "\nTime: " +(end-start)+ " milliseconds");

    } catch (Exception ex) {
         ex.printStackTrace();
         input = null;
    } finally {
        if( reader!=null){
            try {
                reader.close();
            } catch (IOException ioe) {
                // nothing to see here
            }
        }
        if( socket!=null){
            try {
                socket.close();
            } catch (IOException ioe) {
                // nothing to see here
            }
        }
        if( out!=null){
            out.close();
        }
    }
    return input==null? null: input.toString();
}

Answer 1

尝试使用Apache HttpClient而不是自己动手：

static String getUriContentsAsString(String uri) throws IOException {
  HttpClient client = new DefaultHttpClient();
  HttpResponse response = client.execute(new HttpGet(uri));
  return EntityUtils.toString(response.getEntity());
}

如果您这样做是为了真正了解HTTP客户端请求的内部结构，那么您可以从命令行中使用curl开始。这将让你获得所有标题并请求正文平方。然后，调整您的请求以匹配curl中的工作将是一件简单的事。

Answer 2

通过代码我认为你发送'主机'而不是'主机'。由于这是http / 1.1中的强制标头，但在http / 1.0中被忽略，这可能是问题所在。无论如何，您可以使用程序来捕获发送的数据包（即Wirehark），以确保。使用println非常有用，但附加到命令的行分隔符取决于系统属性line.separator。我认为（虽然我不确定）http协议中使用的行分隔符必须是'\ r \ n'。如果您正在捕获数据包，我认为最好检查发送的每一行是否以'\ r \ n'结尾（字节x0D0A）（以防你的os行分隔符不同）

Answer 3

请使用www.4chan.org作为主机。由于boards.4chan.org是302重定向到www.4chan.org，你将无法从boards.4chan.org中获取任何东西。

当HTTP为1.1时，HTTP GET请求在java中不起作用？

3 个答案: