
时间:2015-11-25 16:17:32

标签: java string unicode decoding


Charset charset = Charset.forName( "UTF-8" );
CharsetDecoder decoder = charset.newDecoder();
FileInputStream fis = new FileInputStream( file );
FileChannel fc = fis.getChannel();
int lenFile = (int)fc.size();
MappedByteBuffer bufferFile = fc.map( FileChannel.MapMode.READ_ONLY, 0, lenFile );
CharBuffer cb = decoder.decode( bufferFile );
// process character buffer


3 个答案:

答案 0 :(得分:3)



// Extrapolation...
byte stream --> decoding       --> char stream
InputStream --> CharsetDecoder --> Reader

鲜为人知的事实是,大多数(但不是全部......见下文)JDK中的默认解码器(例如从FileReader创建的解码器,或只有InputStreamReader的解码器charset)的政策为CodingErrorAction.REPLACE。效果是用Unicode replacement character替换输入中的任何无效字节序列(是的,臭名�)。


// This is 2015. File is obsolete.
final Path path = Paths.get(...);
final CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder()

try (
    final InputStream in = Files.newInputStream(path);
    final Reader reader = new InputStreamReader(in, decoder);
) {
    // use the reader

Java 8中出现了对该默认替换操作的一个例外:Files.newBufferedReader(somePath)将尝试以UTF-8读取,并且默认操作为REPORT

答案 1 :(得分:1)


答案 2 :(得分:0)

@fge,我不知道报告选项 - 很酷。 @Tyler,我认为,这个技巧是使用BufferedReader的read()方法: 摘自此处:https://docs.oracle.com/javase/7/docs/api/java/io/BufferedReader.html#read%28char[],%20int,%20int%29

public int read(char[] cbuf,
       int off,
       int len)
         throws IOException


read #1, found 32 chars
read #2, found 32 chars
read #3, found 32 chars
read #4, found 32 chars
read #80, found 32 chars
read #81, found 32 chars
read #82, found 7 chars
Done, read total=2599 chars, readcnt=82

关于上面输出的注意事项恰好以最后的'7'字符结尾;你可以调整缓冲区数组大小来处理你想要的任何“块”大小...这只是一个例子,建议你不必担心在多字节UTF8字符中被卡在“中间字节”的某处。 / p>

import java.io.*;

class Foo {
   public static void main( String args[] ) throws Exception {
      String encoding = "UTF8";
      String inFilename = "unicode-example-utf8.txt";
      // Test file from http://www.i18nguy.com/unicode/unicode-example-intro.htm
      // Specifically the Example Data, CSV format:
      //     http://www.i18nguy.com/unicode/unicode-example-utf8.zip
      char buff[] = new char[ 32 ]; // or whatever size...
      // I know the readers  can be combined to just nest the temp instances,
      // for an  example i think it is easier to parse the structure
      // with each reader explicitly declared.
      FileInputStream finstream = new FileInputStream( inFilename );
      InputStreamReader instream = new InputStreamReader( finstream, encoding );
      BufferedReader in = new BufferedReader( instream );
      int n;
      long total = 0;
      long readcnt = 0;
      while( -1 != (n = in.read( buff, 0, buff.length ) ) ) {
         total += n;
         System.out.println("read #"+readcnt+", found "+n+" chars ");
      System.out.println( "Done, read total="+total+" chars, readcnt="+readcnt );