我试图使用此代码(在Stackoverflow上找到)删除无效的UTF-8字符:
def text = file.text
CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
utf8Decoder.onMalformedInput(CodingErrorAction.IGNORE);
utf8Decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
ByteBuffer bytes = ByteBuffer.allocate(text.getBytes().length * 2)
CharBuffer cbuf = bytes.asCharBuffer()
cbuf.put(text)
cbuf.flip()
CharBuffer parsed = utf8Decoder.decode(bytes);
println parsed.toString()
我得到的输出如下:
< d o c u m e n t >
< t i t l e > S o me T i t l e < / t i t l e >
< s i t e > A S i t e < / s i t e >
关于它为何如此表现的任何想法?
答案 0 :(得分:1)
不知道为什么这不起作用,但这就是解决它的问题(代码在Groovy中,而不是Java):
file.withInputStream { stream ->
CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
utf8Decoder.onMalformedInput(CodingErrorAction.IGNORE);
utf8Decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
def reader = new BufferedReader(new InputStreamReader(stream, utf8Decoder))
def line = null
def sb = new StringBuilder()
while ( (line = reader.readLine()) != null) {
sb.append("$line\n")
}
reader.close()
}