Question

所以我正在尝试从URL下载html页面;

public static void getHtml(){
    URL url;
    InputStream is = null;
    BufferedReader br;
    String line;

    try {
        url = new URL(URL);
        is = url.openStream();  
        br = new BufferedReader(new InputStreamReader(is));

        while ((line = br.readLine()) != null) {
            System.out.println(line);
        }

    }catch(Exception e){

    } finally {
        try {
            if (is != null) is.close();
        } catch (IOException ioe) {

        }
    }
}

问题是，它不是我想要的HTML，而是以下内容：

<html>
 <head>
  <title>loading</title>
 </head>
 <body>
  <p>Please wait...</p>
       <script>document.cookie="a=3c5hb1488cb3eghv3r456t12234jfyr7g;path=/;";location.href=document.location.pathname;</script>
 </body>
</html>

如何直接下载网页内容？我也尝试了jsoup，但它给出了相同的结果。也试过Apache - 同样。

Answer 1

这是我对该网站的猜测。

将此页面返回给第一次访问者
浏览器设置了一个cookie，并刷新（重定向到同一个URL）
使用cookie，服务器回复真实内容

所以它适用于浏览器，但不适用于java。

您可以解析设置的cookie脚本并重播它。 “A = 3c5hb1488cb3eghv3r456t12234jfyr7g;路径= /;”

有关在url connect上设置cookie的信息，请参阅以下帖子 URLConnection with Cookies?

或使用Apache HTTP Client http://hc.apache.org/httpclient-3.x/

用java下载html无法正常工作

1 个答案: