Question

我必须阅读一个文本文件，其中包含以下语言的字母：英语，日语，中文，法语，西班牙语，德语，意大利语

我的任务是简单地读取数据并将其写入新的文本文件（在100个字符之后放置新的行\n）。

我无法使用File.ReadAllText和File.ReadAllLines，因为文件大小可能超过500 MB。所以我写了以下代码：

using (var streamReader = new StreamReader(inputFilePath, Encoding.ASCII))
{
      using (var streamWriter = new StreamWriter(outputFilePath,false))
      {
           char[] bytes = new char[100];
           while (streamReader.Read(bytes, 0, 100) > 0)
           {
                 var data = new string(bytes);
                 streamWriter.WriteLine(data);
           }
           MessageBox.Show("Compleated");
       }
}

除ASCII编码以外，我尝试了UTF-7，UTF-8，UTF-32和IBM500。但是在阅读和编写多语言字符方面没有运气。

请帮助我实现这一目标。

Answer 1

您将需要查看要解析的文件的前4个字节。这些字节将为您提供有关必须使用的编码的提示。

这是我编写的一个帮助方法来执行任务：

public static string GetStringFromEncodedBytes(this byte[] bytes) {
    var encoding = Encoding.Default;
    var skipBytes = 0;
        if (bytes[0] == 0x2b && bytes[1] == 0x2f && bytes[2] == 0x76) {
            encoding = Encoding.UTF7;
            skipBytes = 3;
        }
        if (bytes[0] == 0xef && bytes[1] == 0xbb && bytes[2] == 0xbf) {
            encoding = Encoding.UTF8;
            skipBytes = 3;
        }

        if (bytes[0] == 0xff && bytes[1] == 0xfe) {
            encoding = Encoding.Unicode;
            skipBytes = 2;
        }

        if (bytes[0] == 0xfe && bytes[1] == 0xff) {
            encoding = Encoding.BigEndianUnicode;
            skipBytes = 2;
        }
        if (bytes[0] == 0 && bytes[1] == 0 && bytes[2] == 0xfe && bytes[3] == 0xff) {
            encoding = Encoding.UTF32;
            skipBytes = 4;
        }


        return encoding.GetString(bytes.Skip(skipBytes).ToArray());
    }

Answer 2

这是一个很好的开始得到答案。如果 i 不等于100，则需要阅读更多字符。像é这样的法国字符没有问题 - 它们都是用C＃char类处理的。

char[] soFlow = new char[100];
int posn = 0;
using (StreamReader sr = new StreamReader("a.txt"))
   using (StreamWriter sw = new StreamWriter("b.txt", false))
      while(sr.EndOfStream == false)
      {
          try {
             int i = sr.Read(soFlow, posn%100, 100);
             //if i < 100 need to read again with second char array
             posn += 100;
             sw.WriteLine(new string(soFlow));
          }
          catch(Exception e){Console.WriteLine(e.Message);}
      }

Spec：Read（Char []，Int32，Int32）从指定的索引处开始，将当前流中指定的最大字符数读入缓冲区。

无论如何肯定对我有用:)。

在c＃中读取多语言文本文件

2 个答案: