今天我看到了一个代码如下的问题:
var accumulator = "";
var buffer = new byte[8192];
while (true)
{
var readed = stream.Read(buffer, 0, buffer.Length);
accumulator += Encoding.UTF8.GetString(buffer, 0, readed);
if (readed < buffer.Length)
break;
}
var result = Encoding.UTF8.GetBytes(accumulator);
我知道这段代码效率低但是安全吗?是否有一些字节序列会破坏结果?
答案 0 :(得分:6)
代码明显被打破;如果这是一个答案,那么你应该把这个bug提请作者注意。
显然,UTF-8序列可以是多个字节。如果有一个多字节序列从当前缓冲区的末尾开始并在下一个缓冲区的开头重新开始,那么将每个缓冲区转换为字符串将是错误的。
答案 1 :(得分:1)
安全的方法是使用有状态的UTF8解码器,可以从Encoding.UTF8.GetDecoder()
获得。
有状态解码器将在内部保留与不完整的多字节序列相对应的字节。下次你给它更多字节时,它将完成序列并返回它从序列中解码的字符。
以下是如何使用它的示例。在我的实现中,我使用了char[]
缓冲区,其大小使得我们总是有足够的空间来存储X个字节的完整转换。这样,我们只执行两次内存分配来读取整个流。
public static string ReadStringFromStream( Stream stream )
{
// --- Byte-oriented state ---
// A nice big buffer for us to use to read from the stream.
byte[] byteBuffer = new byte[8192];
// --- Char-oriented state ---
// Gets a stateful UTF8 decoder that holds onto unused bytes when multi-byte sequences
// are split across multiple byte buffers.
var decoder = Encoding.UTF8.GetDecoder();
// Initialize a char buffer, and make it large enough that it will be able to fit
// a full reads-worth of data from the byte buffer without needing to be resized.
char[] charBuffer = new char[Encoding.UTF8.GetMaxCharCount( byteBuffer.Length )];
// --- Output ---
StringBuilder stringBuilder = new StringBuilder();
// --- Working state ---
int bytesRead;
int charsConverted;
bool lastRead = false;
do
{
// Read a chunk of bytes from our stream.
bytesRead = stream.Read( byteBuffer, 0, byteBuffer.Length );
// If we read 0 bytes, we hit the end of stream.
// We're going to tell the converter to flush, and then we're going to stop.
lastRead = ( bytesRead == 0 );
// Convert the bytes into characters, flushing if this is our last conversion.
charsConverted = decoder.GetChars(
byteBuffer,
0,
bytesRead,
charBuffer,
0,
lastRead
);
// Build up a string in a character buffer.
stringBuilder.Append( charBuffer, 0, charsConverted );
}
while( lastRead == false );
return stringBuilder.ToString();
}