httpclient拼写扩展字符

时间:2014-08-19 18:01:51

标签: utf-8 httpclient apache-httpclient-4.x

我正在使用httpclient来检索远程网址,需要抓取诸如标题等内容。

在某些情况下,我会收到乱码扩展字符,就像这个网址

一样

http://olhardigital.uol.com.br/noticia/bilionaria-mais-jovem-da-historia-quer-revolucionar-exames-de-sangue/43586

我尝试过各种各样的设置,但无济于事。有什么建议?我的配置如下:

private CloseableHttpClient httpclient = RemotePageUtils.getThreadSafeClient();

public String processMethod(String url, OutputStream out) throws IOException, IllegalArgumentException{

    [...]

    BufferedReader in = null;
    HttpEntity entity = null;
    HttpGet httpget = null;

    CloseableHttpResponse resp = null;

    try {

        httpget = new HttpGet(url);

        resp = httpclient.execute(httpget);

        entity = resp.getEntity();

        String inLine;

        in = new BufferedReader(new InputStreamReader(entity.getContent(),"UTF-8"));

        while ((inLine = in.readLine()) != null) {

            out.write(inLine.getBytes("UTF-8"));
        }

    } finally {

        [...]

    }
    return null;
}

private static CloseableHttpClient getThreadSafeClient() {

    SocketConfig socketConfig = SocketConfig.custom()
            .setTcpNoDelay(true)
            .build();

    RequestConfig config = RequestConfig.custom()
            .setConnectTimeout(3000)
            .setSocketTimeout(7000)
            .setStaleConnectionCheckEnabled(false)
            .build();

    List<Header> headers = new ArrayList<Header>();
    headers.add(new BasicHeader("Accept-Charset","ISO-8859-1,US-ASCII,UTF-8,UTF-16;q=0.7,*;q=0.7"));
    //accept gzipped
    headers.add(new BasicHeader("Accept-Encoding","gzip,x-gzip,deflate,sdch"));


    CloseableHttpClient client = HttpClientBuilder.create()
            .setDefaultHeaders(headers)
            .setDefaultRequestConfig(config)
            .setDefaultSocketConfig(socketConfig)
            .build();

    return client;

}

1 个答案:

答案 0 :(得分:1)

您盲目地将所有下载的页面解释为UTF-8,但您提供的示例链接不是UTF-8,而是ISO-8859-1。

ISO-8859-1中的重音字母是一个字节&gt; = 128,其中在UTF-8中,这些字节必须遵循特定的模式,在其他情况下,它们被视为已损坏。

但是为什么要解码已下载的字节,只是为了将字节写入文件?

而不是:

 in = new BufferedReader(new InputStreamReader(entity.getContent(),"UTF-8"));
 while ((inLine = in.readLine()) != null) {
     out.write(inLine.getBytes("UTF-8"));
 }

并将字节转换为字符串并返回,您应该只复制字节。

您可以使用Apache Commons IO:

import org.apache.commons.io.IOUtils;

IOUtils.copy(entity.getContent(), out);

或手动,使用纯Java:

byte[] buf = new byte[16 * 1024];
int len = 0;
InputStream in = entity.getContent();
while ((len = in.read(buf)) >= 0) {
    out.write(buf, 0, len);
}