如何用Java从这个页面读取html内容?

时间:2016-11-21 17:54:55

标签: java html url

我的Java应用尝试阅读以下网址中的内容:https://www.iplocation.net/?query=62.92.63.48

我使用了以下方法:

  StringBuffer readFromUrl(String Url)
  {
    StringBuffer sb=new StringBuffer();
    BufferedReader in=null;

    try
    {
      in=new BufferedReader(new InputStreamReader(new URL(Url).openStream()));
      String inputLine;

      while ((inputLine=in.readLine()) != null) sb.append(inputLine+"\n");
      in.close();
    }
    catch (Exception e) { e.printStackTrace(); }
    finally 
    {
      try 
      {
        if (in!=null)
        {
          in.close();
          in=null;
        }
      }
      catch (Exception ex) { ex.printStackTrace(); }
    }
    return sb;
  }

通常它适用于其他网址,但对于这个网址,结果与浏览器中显示的结果不同,它看起来像这样:

<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script>
(function(){function getSessionCookies(){var cookieArray=new Array();var cName=/^\s?incap_ses_/;var c=document.cookie.split(";");for(var i=0;i<c.length;i++){var key=c[i].substr(0,c[i].indexOf("="));var value=c[i].substr(c[i].indexOf("=")+1,c[i].length);if(cName.test(key)){cookieArray[cookieArray.length]=value}}return cookieArray}function setIncapCookie(vArray){var res;try{var cookies=getSessionCookies();var digests=new Array(cookies.length);for(var i=0;i<cookies.length;i++){digests[i]=simpleDigest((vArray)+cookies[i])}res=vArray+",digest="+(digests.join())}catch(e){res=vArray+",digest="+(encodeURIComponent(e.toString()))}createCookie("___utmvc",res,20)}function simpleDigest(mystr){var res=0;for(var i=0;i<mystr.length;i++){res+=mystr.charCodeAt(i)}return res}function createCookie(name,value,seconds){var expires="";if(seconds){var date=new Date();date.setTime(date.getTime()+(seconds*1000));var expires="; expires="+date.toGMTString()}document.cookie=name+"="+value+expires+"; path=/"}function test(o){var res="";var vArray=new Array();for(var j=0;j<o.length;j++){var test=o[j][0];switch(o[j][1]){case"exists":try{if(typeof(eval(test))!="undefined"){vArray[vArray.length]=encodeURIComponent(test+"=true")}else{vArray[vArray.length]=encodeURIComponent(test+"=false")}}catch(e){vArray[vArray.length]=encodeURIComponent(test+"=false")}break;case"value":try{try{res=eval(test);if(typeof(res)==="undefined"){vArray[vArray.length]=encodeURIComponent(test+"=undefined")}else if(res===null){vArray[vArray.length]=encodeURIComponent(test+"=null")}else{vArray[vArray.length]=encodeURIComponent(test+"="+res.toString())}}catch(e){vArray[vArray.length]=encodeURIComponent(test+"=cannot evaluate");break}break}catch(e){vArray[vArray.length]=encodeURIComponent(test+"="+e)}case"plugin_extentions":try{var extentions=[];try{i=extentions.indexOf("i")}catch(e){vArray[vArray.length]=encodeURIComponent("plugin_ext=indexOf is not a function");break}try{var num=navigator.plugins.length if(num==0||num==null){vArray[vArray.length]=encodeURIComponent("plugin_ext=no plugins");break}}catch(e){vArray[vArray.length]=encodeURIComponent("plugin_ext=cannot evaluate");break}for(var i=0;i<navigator.plugins.length;i++){if(typeof(navigator.plugins[i])=="undefined"){vArray[vArray.length]=encodeURIComponent("plugin_ext=plugins[i] is undefined");break}var filename=navigator.plugins[i].filename var ext="no extention";if(typeof(filename)=="undefined"){ext="filename is undefined"}else if(filename.split(".").length>1){ext=filename.split('.').pop()}if(extentions.indexOf(ext)<0){extentions.push(ext)}}for(i=0;i<extentions.length;i++){vArray[vArray.length]=encodeURIComponent("plugin_ext="+extentions[i])}}catch(e){vArray[vArray.length]=encodeURIComponent("plugin_ext="+e)}break}}vArray=vArray.join();return vArray}var o=[["navigator","exists"],["navigator.vendor","value"],["navigator.appName","value"],["navigator.plugins.length==0","value"],["navigator.platform","value"],["navigator.webdriver","value"],["platform","plugin_extentions"],["ActiveXObject","exists"],["webkitURL","exists"],["_phantom","exists"],["callPhantom","exists"],["chrome","exists"],["yandex","exists"],["opera","exists"],["opr","exists"],["safari","exists"],["awesomium","exists"],["puffinDevice","exists"],["navigator.cpuClass","exists"],["navigator.oscpu","exists"],["navigator.connection","exists"],["window.outerWidth==0","value"],["window.outerHeight==0","value"],["window.WebGLRenderingContext","exists"],["document.documentMode","value"],["eval.toString().length","value"]];try{setIncapCookie(test(o));document.createElement("img").src="/_Incapsula_Resource?SWKMTFSR=1&e="+Math.random()}catch(e){img=document.createElement("img");img.src="/_Incapsula_Resource?SWKMTFSR=1&e="+e}})();
</script>
<script>
(function() { 
var z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D2273746128......6F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval('String.fromCharCode('+z+')'));})();
</script></head>
<body>
<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>
</body></html>

那么在这种情况下,阅读浏览器中显示的html内容的正确方法是什么?

编辑:阅读建议后,我已将程序更新为如下所示:

StringBuilder response=new StringBuilder();
String USER_AGENT="Mozilla/5.0",inputLine;
BufferedReader in=null;    

try
{
  HttpURLConnection con=(HttpURLConnection)new URL(Url).openConnection();
  con.setRequestMethod("GET");
  con.setRequestProperty("Accept-Charset","UTF-8");
  con.setRequestProperty("User-Agent",USER_AGENT);                         // Add request header

  int responseCode=con.getResponseCode();
  in=new BufferedReader(new InputStreamReader(con.getInputStream()));
  while ((inputLine=in.readLine())!=null) { response.append(inputLine); }
  in.close();
}
catch (Exception e) { e.printStackTrace(); }
finally 
{
  try { if (in!=null) in.close(); }
  catch (Exception ex) { ex.printStackTrace(); }
}
return response.toString();

但仍然没有奏效,我得到的反应如下:

<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe src="/_Incapsula_Resource?CWUDNSAI=24&xinfo=8-75933493-0 0NNN RT(1479758027223 127) q(0 -1 -1 -1) r(0 -1) B12(4,315,0) U10000&incident_id=516000100118713619-514529209419563176&edet=12&cinfo=04000000" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 516000100118713619-514529209419563176</iframe></body></html>

有人可以展示一些有效的示例代码吗?

感谢@thatguy我已将程序修改为如下所示:

import java.util.*;
import java.util.concurrent.*;
import java.io.*;
import java.net.*;
import java.util.Map.Entry;

public class Read_From_Url_Runner implements Callable<String[]>
{
  int Id;
  String Read_From_Url_Result[]=null,IP_Location_Url="https://www.iplocation.net/?query=[IP]",IP="62.92.63.48",Cookie,Result[],A_Url;

  public Read_From_Url_Runner(int Id)
  {
    this.Id=Id;

    A_Url=IP_Location_Url.replace("[IP]",IP);
    Cookie=getIncapsulaCookie(A_Url);
    Out("Cookie = [ "+Cookie+" ]");

    try
    {
      Result=call();
//      for (int i=0;i<Result.length;i++) Out(Result[i]);
    }
    catch (Exception e) { e.printStackTrace(); }
  }

  public String[] call() throws InterruptedException
  {
    String Text;

    try
    {
      Text=readUrl(A_Url,Cookie);
      Out(Text);
    }
    catch (Exception e)
    {
      Out(" --> Error in data : IP = "+IP);
//    e.printStackTrace();
    }
    return Read_From_Url_Result;
  }

  public static String readUrl(String url,String incapsulaCookie)
  {
    StringBuilder response=new StringBuilder();
    String USER_AGENT="Mozilla/5.0",inputLine;
    BufferedReader in=null;

    try
    {
      HttpURLConnection connection=(HttpURLConnection)new URL(url).openConnection();
      connection.setRequestMethod("GET");
      connection.setRequestProperty("Accept","text/html; charset=UTF-8");
      connection.setRequestProperty("User-Agent",USER_AGENT);
      connection.setDoInput(true);
      connection.setDoOutput(true);
      connection.setRequestProperty("Cookie",incapsulaCookie);                           // Set the Incapsula cookie
      Out(connection.getRequestProperty("Cookie"));

      in=new BufferedReader(new InputStreamReader(connection.getInputStream()));
      while ((inputLine=in.readLine())!=null) { response.append(inputLine+"\n"); }
      in.close();
    }
    catch (Exception e) { e.printStackTrace(); }
    finally
    {
      try { if (in!=null) in.close(); }
      catch (Exception ex) { ex.printStackTrace(); }
    }
    return response.toString();
  }

  public static String getIncapsulaCookie(String url)
  {
    String USER_AGENT="Mozilla/5.0",incapsulaCookie=null,visid=null,incap=null;          // Cookies for Incapsula, preserve order
    BufferedReader in=null;

    try
    {
      HttpURLConnection cookieConnection=(HttpURLConnection)new URL(url).openConnection();
      cookieConnection.setRequestMethod("GET");
      cookieConnection.setRequestProperty("Accept","text/html; charset=UTF-8");
      cookieConnection.setRequestProperty("User-Agent",USER_AGENT);
      cookieConnection.connect();

      for (Entry<String,List<String>> header : cookieConnection.getHeaderFields().entrySet())
      {
        if (header.getKey()!=null && header.getKey().equals("Set-Cookie"))               // Incapsula gives you the required cookies
        {
          for (String cookieValue : header.getValue())                                   // Search for the desired cookies
          {
            if (cookieValue.contains("visid")) visid=cookieValue.substring(0,cookieValue.indexOf(";")+1);
            if (cookieValue.contains("incap_ses")) incap=cookieValue.substring(0,cookieValue.indexOf(";"));
          }
        }
      }
      incapsulaCookie=visid+" "+incap;
      cookieConnection.disconnect();
    }
    catch (Exception e) { e.printStackTrace(); }
    finally
    {
      try { if (in!=null) in.close(); }
      catch (Exception ex) { ex.printStackTrace(); }
    }
    return incapsulaCookie;
  }

  private static void out(String message) { System.out.print(message); }
  private static void Out(String message) { System.out.println(message); }

  public static void main(String[] args)
  {
    final Read_From_Url_Runner demo=new Read_From_Url_Runner(0);
  }
}

但这只得到了响应的第一部分,如下所示:

enter image description here

我真正想要的是以下内容:

enter image description here

通过在How to shut down Javafx?

运行我的程序获得了此结果

1 个答案:

答案 0 :(得分:3)

您遇到的问题可能主要是 HTTP请求标头,您未明确设置。网站通常以不同的表示形式提供,具体取决于HTTP标头(和有效负载)中的属性,以便以适当的方式为桌面或移动客户端提供服务。关于您的代码,您没有设置任何内容,因此无论库设置如何,您都会发送默认标头。如果您检查浏览器正在发送的具体HTTP标头,则很可能存在差异(如用户代理或编码,......)。如果在代码中重建标题,结果应该相同。

此外,您可以使用HttpUrlConnection,因此您可以轻松设置或读取相应的HTTP标头,例如在this SO帖子中。否则,对于URLConnection,请查看here

进一步调查

您的方法会反复出现一个特殊的错误页面,表明该网站使用了来自 Incapsula 的其他安全功能。你得到的网站看起来像这样:

Incapsula error page

当我调查标题时,我注意到需要存在两个cookie字符串,因此您可以直接访问网站,而不是安全检查:

visid_incap_...=...
incap_ses_..._...=...

您可以执行的操作是使用单个请求登录错误页面,这会在Set-Cookie标头中为您提供两个Cookie字符串。然后,您可以直接向网站请求Cookie字符串设置为visid_incap_...=...; incap_ses_..._...=...。您可以多次执行请求,直到cookie过期。只需检查错误页面即可检测到。这是工作代码,显然缺少样式和额外的检查,但解决了您的问题。其余的由你决定。

public static String getIncapsulaCookie(String url) {

    String USER_AGENT = "Mozilla/5.0";
    BufferedReader in = null;

    String incapsulaCookie = null;

    try {

        HttpURLConnection cookieConnection =
                (HttpURLConnection) new URL(url).openConnection();
        cookieConnection.setRequestMethod("GET");
        cookieConnection.setRequestProperty("Accept",
                "text/html; charset=UTF-8");
        cookieConnection.setRequestProperty("User-Agent", USER_AGENT);

        // Disable 'keep-alive'
        cookieConnection.setRequestProperty("Connection", "close");

        // Cookies for Incapsula, preserve order
        String visid = null;
        String incap = null;

        cookieConnection.connect();

        for (Entry<String, List<String>> header : cookieConnection
                .getHeaderFields().entrySet()) {

            // Incapsula gives you the required cookies
            if (header.getKey() != null
                    && header.getKey().equals("Set-Cookie")) {

                // Search for the desired cookies
                for (String cookieValue : header.getValue()) {
                    if (cookieValue.contains("visid")) {
                        visid = cookieValue.substring(0,
                                cookieValue.indexOf(";") + 1);
                    }
                    if (cookieValue.contains("incap_ses")) {
                        incap = cookieValue.substring(0,
                                cookieValue.indexOf(";"));
                    }
                }
            }
        }

        incapsulaCookie = visid + " " + incap;

        // Explicitly disconnect, also essential in this method!
        cookieConnection.disconnect();

    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        try {
            if (in != null)
                in.close();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }

    return incapsulaCookie;

}

此方法为您提取封装cookie。以下是您的方法的修改版本,它使用cookie:

public static String readUrl(String url, String incapsulaCookie) {

    StringBuilder response = new StringBuilder();
    String USER_AGENT = "Mozilla/5.0", inputLine;
    BufferedReader in = null;

    try {

        HttpURLConnection connection =
                (HttpURLConnection) new URL(url).openConnection();
        connection.setRequestMethod("GET");
        connection.setRequestProperty("Accept", "text/html; charset=UTF-8");
        connection.setRequestProperty("User-Agent", USER_AGENT);

        // Set the Incapsula cookie
        connection.setRequestProperty("Cookie", incapsulaCookie);

        in = new BufferedReader(
                new InputStreamReader(connection.getInputStream()));

        while ((inputLine = in.readLine()) != null) {
            response.append(inputLine);
        }

        in.close();

    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        try {
            if (in != null)
                in.close();
        } catch (Exception ex) {
            ex.printStackTrace();
        }
    }
    return response.toString();

}

正如我所观察到的,用户代理和其他属性似乎并不重要。您现在可以拨打getIncapsulaCookie(String url)一次或在需要新Cookie时,获取Cookie并readUrl(String url, String incapsulaCookie) 多次来请求该网页,直到Cookie过期为止。结果是完整 HTML页面,如此部分图片中所示:

enter image description here

重要细节getIncapsulaCookie(...)方法中有两个基本命令,即cookieConnection.setRequestProperty("Connection", "close");cookieConnection.disconnect();。如果您想立即致电readUrl(...) ,则两者都必需。如果省略这些命令,收到cookie后,服务器端的HTTP连接将保持活动状态,下一次调用readUrl(...)将向您返回错误的页面。您可以通过省略这些命令来尝试此操作,而是拨打getIncapsulaCookie(...)然后等待5到65秒并致电readUrl(...)。您将看到这也有效,因为连接会自动超时。另请参阅here