如何在Java中处理非UTF8 html页面?

时间:2012-01-05 17:30:35

标签: java html string encoding httpurlconnection

我的任务是使用Java从URL中检索html字符串。

我知道如何使用HttpUrlConnection& InputStream获取字符串。

但是,我对某些页面存在编码问题。

如果某些页面具有不同的编码(例如,GB2312),而不是UTF8,我得到的字符串只是任意字符或问号。

任何人都可以告诉我如何解决这个问题吗?

由于

以下是我从网址下载html的代码。

private String downloadHtml(String urlString) {
    URL url = null;
    InputStream inStr = null;
    StringBuffer buffer = new StringBuffer();

    try {
        url = new URL(urlString);
        HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
        HttpURLConnection.setFollowRedirects(true);
        // allow both GZip and Deflate (ZLib) encodings
        //conn.setRequestProperty("Accept-Encoding", "gzip, deflate"); 
        String encoding = conn.getContentEncoding();
        inStr = null;

        // create the appropriate stream wrapper based on
        // the encoding type
        if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
            inStr = new GZIPInputStream(conn.getInputStream());
        } else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
            inStr = new InflaterInputStream(conn.getInputStream(),
              new Inflater(true));
        } else {
            inStr = conn.getInputStream();
        }
        int ptr = 0;


        InputStreamReader inStrReader = new InputStreamReader(inStr, Charset.forName("GB2312"));

        while ((ptr = inStrReader.read()) != -1) {
            buffer.append((char)ptr);
        }
        inStrReader.close();

        conn.disconnect();
    }
    catch(Exception e) {

        e.printStackTrace();
    }
    finally {
        if (inStr != null)
            try {
                inStr.close();
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
    }

    return buffer.toString();
}

2 个答案:

答案 0 :(得分:3)

使用InputStreamReader并指定您的字符集,如下所示:

inStr = new InputStreamReader(InputStream, Charset.forName("GB2312"));

以下代码对我有用:

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.nio.charset.Charset;
import java.util.zip.GZIPInputStream;
import java.util.zip.Inflater;
import java.util.zip.InflaterInputStream;

public class Foo {

public static void main(String[] args) {
    System.out.println(downloadHtml("http://baike.baidu.com/view/6000001.htm"));
}


private static String downloadHtml(String urlString) {
    URL url = null;
    InputStream inStr = null;
    StringBuffer buffer = new StringBuffer();

    try {
        url = new URL(urlString);
        HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
        HttpURLConnection.setFollowRedirects(true);
        // allow both GZip and Deflate (ZLib) encodings
        //conn.setRequestProperty("Accept-Encoding", "gzip, deflate"); 
        String encoding = conn.getContentEncoding();
        inStr = null;

        // create the appropriate stream wrapper based on
        // the encoding type
        if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
            inStr = new GZIPInputStream(conn.getInputStream());
        } else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
            inStr = new InflaterInputStream(conn.getInputStream(),
              new Inflater(true));
        } else {
            inStr = conn.getInputStream();
        }
        int ptr = 0;


        InputStreamReader inStrReader = new InputStreamReader(inStr, Charset.forName("GB2312"));

        while ((ptr = inStrReader.read()) != -1) {
            buffer.append((char)ptr);
        }
        inStrReader.close();

        conn.disconnect();
    }
    catch(Exception e) {

        e.printStackTrace();
    }
    finally {
        if (inStr != null)
            try {
                inStr.close();
            } catch (IOException e) {
                // TODO Auto-generated catch block
                e.printStackTrace();
            }
    }

    return buffer.toString();
  }

}

答案 1 :(得分:1)

使用构造函数InputStreamReader(Charset cs中的InputStream)读取带有InputStreamReader的inputStream