我正在使用httpclient来检索远程网址,需要抓取诸如标题等内容。
在某些情况下,我会收到乱码扩展字符,就像这个网址
一样我尝试过各种各样的设置,但无济于事。有什么建议?我的配置如下:
private CloseableHttpClient httpclient = RemotePageUtils.getThreadSafeClient();
public String processMethod(String url, OutputStream out) throws IOException, IllegalArgumentException{
[...]
BufferedReader in = null;
HttpEntity entity = null;
HttpGet httpget = null;
CloseableHttpResponse resp = null;
try {
httpget = new HttpGet(url);
resp = httpclient.execute(httpget);
entity = resp.getEntity();
String inLine;
in = new BufferedReader(new InputStreamReader(entity.getContent(),"UTF-8"));
while ((inLine = in.readLine()) != null) {
out.write(inLine.getBytes("UTF-8"));
}
} finally {
[...]
}
return null;
}
private static CloseableHttpClient getThreadSafeClient() {
SocketConfig socketConfig = SocketConfig.custom()
.setTcpNoDelay(true)
.build();
RequestConfig config = RequestConfig.custom()
.setConnectTimeout(3000)
.setSocketTimeout(7000)
.setStaleConnectionCheckEnabled(false)
.build();
List<Header> headers = new ArrayList<Header>();
headers.add(new BasicHeader("Accept-Charset","ISO-8859-1,US-ASCII,UTF-8,UTF-16;q=0.7,*;q=0.7"));
//accept gzipped
headers.add(new BasicHeader("Accept-Encoding","gzip,x-gzip,deflate,sdch"));
CloseableHttpClient client = HttpClientBuilder.create()
.setDefaultHeaders(headers)
.setDefaultRequestConfig(config)
.setDefaultSocketConfig(socketConfig)
.build();
return client;
}
答案 0 :(得分:1)
您盲目地将所有下载的页面解释为UTF-8,但您提供的示例链接不是UTF-8,而是ISO-8859-1。
ISO-8859-1中的重音字母是一个字节&gt; = 128,其中在UTF-8中,这些字节必须遵循特定的模式,在其他情况下,它们被视为已损坏。
但是为什么要解码已下载的字节,只是为了将字节写入文件?
而不是:
in = new BufferedReader(new InputStreamReader(entity.getContent(),"UTF-8"));
while ((inLine = in.readLine()) != null) {
out.write(inLine.getBytes("UTF-8"));
}
并将字节转换为字符串并返回,您应该只复制字节。
您可以使用Apache Commons IO:
import org.apache.commons.io.IOUtils;
IOUtils.copy(entity.getContent(), out);
或手动,使用纯Java:
byte[] buf = new byte[16 * 1024];
int len = 0;
InputStream in = entity.getContent();
while ((len = in.read(buf)) >= 0) {
out.write(buf, 0, len);
}