如何从(内部)https页面中抓取html内容

时间:2012-04-30 16:42:08

标签: java html web-scraping

我正在尝试将源代码下载到Intranet上的页面。我可以在所有浏览器上访问该页面,而无需明确登录。

当我尝试使用下面的代码获取页面内容时,它会失败并显示以下错误代码:

public scrape() throws IOException{

    String httpsURL = "https://myurl.aspx";
    URL myurl = new URL(httpsURL);
    HttpsURLConnection con = (HttpsURLConnection)myurl.openConnection();
    InputStream ins = con.getInputStream();  //breaks here
    InputStreamReader isr = new InputStreamReader(ins);
    BufferedReader in = new BufferedReader(isr);

    String inputLine;

    while ((inputLine = in.readLine()) != null)
    {
        System.out.println(inputLine);
    }

    in.close();

}

错误:线程“main”中的异常java.io.IOException:服务器返回HTTP响应代码:500为URL:https://myurl.aspx

它特意在线上打破 - > InputStream ins = con.getInputStream();

我不确定如何纠正这个问题?

1 个答案:

答案 0 :(得分:1)

首先要做的是,正如他/她的评论中的nsfyn55,使用浏览器检查标题。有些网站在返回响应之前检查User-Agent HTTP Header。要做的第二件事是,在使用HTTPS时,您需要正确初始化安全层。检查此课程:

public class SSLConfiguration {

    private static boolean isSslInitialized = false;
    private static final String PROTOCOL = "SSL";
    public static boolean ACCEPT_ALL_CERTS = true;

    public static void initializeSSLConnection() {
        if (!isSslInitialized) {
            if (ACCEPT_ALL_CERTS) {
                initInsecure();
            } else {
                initSsl();
            }
        }
    }

    private static void initInsecure() {
        TrustManager[] trustAllCerts = new TrustManager[]{
            new X509TrustManager() {

                @Override
                public java.security.cert.X509Certificate[] getAcceptedIssuers() {
                    return null;
                }

                @Override
                public void checkClientTrusted(
                        java.security.cert.X509Certificate[] certs, String authType) {
                }

                @Override
                public void checkServerTrusted(
                        java.security.cert.X509Certificate[] certs, String authType) {
                }
            }
        };

        // Install the all-trusting trust manager
        try {
            SSLContext sc = SSLContext.getInstance(PROTOCOL);
            sc.init(null, trustAllCerts, new java.security.SecureRandom());
            HttpsURLConnection.setDefaultSSLSocketFactory(sc.getSocketFactory());
        } catch (Exception e) {
        }
        HttpsURLConnection.setDefaultHostnameVerifier(
                new HostnameVerifier() {

                    @Override
                    public boolean verify(String string, SSLSession ssls) {
                        return true;
                    }
                });
        isSslInitialized = true;
    }

    private static void initSsl() {
        SSLContext sc = null;
        try {
            sc = SSLContext.getInstance(PROTOCOL);
        } catch (NoSuchAlgorithmException ex) {
            throw new RuntimeException(ex);
        }
        try {
            sc.init(null, null, new SecureRandom());
        } catch (KeyManagementException ex) {
            throw new RuntimeException(ex);
        }
        HttpsURLConnection.setDefaultSSLSocketFactory(sc.getSocketFactory());
        HostnameVerifier hv = new HostnameVerifier() {

            @Override
            public boolean verify(String urlHostName, SSLSession session) {
                /* This is to avoid spoofing */
                return (urlHostName.equals(session.getPeerHost()));
            }
        };

        HttpsURLConnection.setDefaultHostnameVerifier(hv);
        isSslInitialized = true;
    }
}

连接很可能会失败 - 特别是如果网站没有正确的证书。在您的代码中,在类的构造函数中,插入以下代码:

SSLConfiguration.initializeSSLConnection();

还需要考虑一些事项 - 在openConnection之后,建议您添加以下内容:

con.setRequestMethod(METHOD);
con.setDoInput(true);
con.setDoOutput(true);
con.setUseCaches(false);

但我倾向于相信,因为您从远程服务器获得响应,这更像是指定正确的标头,特别是User-AgentAccept。如果上述方法无法帮助您解决问题,请打印出错误的堆栈跟踪并读取错误流(来自远程)以获取更有意义的错误消息。如果您使用Firefox,Live HTTP Headers是一个非常方便的解决方案。在处理HTTP请求时,cURL也是最强大的命令行工具。