我尝试加载多个具有不同语言内容的网站。只有我看到的<?>
元素的俄语内容。请帮我解码为正确的符号。我的代码示例:
RequestTask t = new RequestTask();
response = t.doIt("http://google.ru"); //troubles
//response = t.doIt("http://stackoverflow.com"); //ok
//response = t.doIt("http://web.de/"); //ok
//response = t.doIt("http://www.china.com/"); // omg, it's ok too!
StatusLine statusLine = response.getStatusLine();
if(statusLine.getStatusCode() == HttpStatus.SC_OK){
ByteArrayOutputStream out = new ByteArrayOutputStream();
response.getEntity().writeTo(out);
out.close();
String response_string = new String(out.toByteArray(), "UTF-8");
请求代码:
public class RequestTask {
public HttpResponse doIt(String... uri)
throws ConnectTimeoutException, UnknownHostException, IOException{
HttpParams params = new BasicHttpParams();
HttpConnectionParams.setConnectionTimeout(params, 6000);
HttpConnectionParams.setSoTimeout(params, 6000);
HttpClient httpclient = new DefaultHttpClient(params);
HttpResponse response = null;
Log.d(this.toString(), "HTTP GET to " + uri[0]);
response = httpclient.execute(new HttpGet(uri[0]));
Log.d(this.toString(), "response: " + response.getStatusLine().getReasonPhrase());
return response;
}
}
答案 0 :(得分:0)
我认为google.ru
没有任何问题:
$ wget google.ru
[...skipped....]
$ enca -L ru index.html
MS-Windows code page 1251
LF line terminators
你应该永远记住,至少有3种其他或多或少使用的编码,可以在俄文内容的页面上找到。除了“UTF-8”,我绝对会检查“KOI-8R”,“WIN-1251”和(不太受欢迎)“Mac Cyrillic”。
你可能最好使用这样的东西:
encoding = ( "win-1251", "koi8-r" ) # maybe some others...
for enc in encoding:
try:
result = unicode( data, enc )
break
except:
result = ""
continue
if result:
print name + "\t: " + enc
else:
print name + "\t: unable to determine the encoding"