如何读取Unicode字符?就像“ä”
public static string Read(int length, string absolutePath)
{
StringBuilder resultAsString = new StringBuilder();
using (MemoryMappedFile memoryMappedFile = MemoryMappedFile.CreateFromFile(absolutePath))
using (MemoryMappedViewStream memoryMappedViewStream = memoryMappedFile.CreateViewStream(0, length))
{
for (int i = 0; i < length; i++)
{
int result = memoryMappedViewStream.ReadByte();
if (result == -1)
{
break;
}
char letter = (char)result;
resultAsString.Append(letter);
}
}
return resultAsString.ToString();
}
读取的int
(结果)为195,而char
强制转换给我的结果不理想。
答案 0 :(得分:2)
不确定是不是您要的,但是可以使用StreamReader
StreamReader sr = new StreamReader(stream, Encoding.Unicode);
答案 1 :(得分:0)
如果您只想将UTF-8文件加载并读取到字符串变量中,则代码可以很简单
var text = File.ReadAllText(filePath, Encoding.UTF8);
但是,如果您坚持要逐字节处理UTF-8数据,则解析会更加复杂。
这是一个粗略的(但可行的)草图,以配合您的原始代码:
StringBuilder resultAsString = new StringBuilder();
using (MemoryMappedFile memoryMappedFile = MemoryMappedFile.CreateFromFile(filePath))
using (MemoryMappedViewStream viewStream = memoryMappedFile.CreateViewStream(0, new FileInfo(filePath).Length))
{
int b;
while((b = viewStream.ReadByte()) != -1)
{
int acc = b;
bool readUtfDataBytes(int bytesToRead)
{
while (bytesToRead-- > 0)
{
var nextB = viewStream.ReadByte();
if (nextB == -1) return false; // EOS reached
if ((nextB & 0xC0) != 0x80) return false; // invalid UTF-8 data byte
acc <<= 6;
acc |= nextB & 0x3F;
}
return true;
}
if (b >= 0xF0) // 1111 0000
{
acc &= 0x07;
if (!readUtfDataBytes(3)) break; // break on malformed UTF-8
}
else if (b >= 0xE0) // 1110 0000
{
acc &= 0x0F;
if (!readUtfDataBytes(2)) break; // break on malformed UTF-8
}
else if (b >= 0xC0) // 1100 0000
{
acc &= 0x1F;
if (!readUtfDataBytes(1)) break; // break on malformed UTF-8
}
else if (b >= 0x80)
{
break; // break on malformed UTF-8
}
if (acc == 0xFEFF)
{
// ignore UTF-8 BOM
}
else
{
resultAsString.Append(Char.ConvertFromUtf32(acc));
}
}
}