Question

我正在使用httpclient来检索远程网址，需要抓取诸如标题等内容。

在某些情况下，我会收到乱码扩展字符，就像这个网址

一样

http://olhardigital.uol.com.br/noticia/bilionaria-mais-jovem-da-historia-quer-revolucionar-exames-de-sangue/43586

我尝试过各种各样的设置，但无济于事。有什么建议？我的配置如下：

private CloseableHttpClient httpclient = RemotePageUtils.getThreadSafeClient();

public String processMethod(String url, OutputStream out) throws IOException, IllegalArgumentException{

    [...]

    BufferedReader in = null;
    HttpEntity entity = null;
    HttpGet httpget = null;

    CloseableHttpResponse resp = null;

    try {

        httpget = new HttpGet(url);

        resp = httpclient.execute(httpget);

        entity = resp.getEntity();

        String inLine;

        in = new BufferedReader(new InputStreamReader(entity.getContent(),"UTF-8"));

        while ((inLine = in.readLine()) != null) {

            out.write(inLine.getBytes("UTF-8"));
        }

    } finally {

        [...]

    }
    return null;
}

private static CloseableHttpClient getThreadSafeClient() {

    SocketConfig socketConfig = SocketConfig.custom()
            .setTcpNoDelay(true)
            .build();

    RequestConfig config = RequestConfig.custom()
            .setConnectTimeout(3000)
            .setSocketTimeout(7000)
            .setStaleConnectionCheckEnabled(false)
            .build();

    List<Header> headers = new ArrayList<Header>();
    headers.add(new BasicHeader("Accept-Charset","ISO-8859-1,US-ASCII,UTF-8,UTF-16;q=0.7,*;q=0.7"));
    //accept gzipped
    headers.add(new BasicHeader("Accept-Encoding","gzip,x-gzip,deflate,sdch"));


    CloseableHttpClient client = HttpClientBuilder.create()
            .setDefaultHeaders(headers)
            .setDefaultRequestConfig(config)
            .setDefaultSocketConfig(socketConfig)
            .build();

    return client;

}

Answer 1

您盲目地将所有下载的页面解释为UTF-8，但您提供的示例链接不是UTF-8，而是ISO-8859-1。

ISO-8859-1中的重音字母是一个字节＆gt; = 128，其中在UTF-8中，这些字节必须遵循特定的模式，在其他情况下，它们被视为已损坏。

但是为什么要解码已下载的字节，只是为了将字节写入文件？

而不是：

 in = new BufferedReader(new InputStreamReader(entity.getContent(),"UTF-8"));
 while ((inLine = in.readLine()) != null) {
     out.write(inLine.getBytes("UTF-8"));
 }

并将字节转换为字符串并返回，您应该只复制字节。

您可以使用Apache Commons IO：

import org.apache.commons.io.IOUtils;

IOUtils.copy(entity.getContent(), out);

或手动，使用纯Java：

byte[] buf = new byte[16 * 1024];
int len = 0;
InputStream in = entity.getContent();
while ((len = in.read(buf)) >= 0) {
    out.write(buf, 0, len);
}

httpclient拼写扩展字符

1 个答案: