Question

StreamReader如何读取所有字符，包括0x0D 0x0A字符？

我有一个旧的.txt文件，我试图隐蔽。许多行（但不是全部）以“0x0D 0x0D 0x0A”结尾。

此代码读取所有行。

StreamReader srFile = new StreamReader(gstPathFileName);
while (!srFile.EndOfStream) {
    string stFileContents = srFile.ReadLine();
    ...
}

这会在每个.txt行之间产生额外的“”字符串。由于段落之间有一些空行，删除所有“”字符串会删除这些空行。

有没有办法让StreamReader读取所有字符，包括“0x0D 0x0D 0x0A”？

两小时后编辑......文件很大，1.6MB。

Answer 1

ReadLine的一个非常简单的重新实现。我做了一个返回IEnumerable<string>的版本，因为它更容易。我把它放在扩展方法中，所以static class。代码被大量评论，所以应该很容易阅读。

public static class StreamEx
{
    public static string[] ReadAllLines(this TextReader tr, string separator)
    {
        return tr.ReadLines(separator).ToArray();
    }

    // StreamReader is based on TextReader
    public static IEnumerable<string> ReadLines(this TextReader tr, string separator)
    {
        // Handling of empty file: old remains null
        string old = null;

        // Read buffer
        var buffer = new char[128];

        while (true)
        {
            // If we already read something
            if (old != null)
            {
                // Look for the separator
                int ix = old.IndexOf(separator);

                // If found
                if (ix != -1)
                {
                    // Return the piece of line before the separator
                    yield return old.Remove(ix);

                    // Then remove the piece of line before the separator plus the separator
                    old = old.Substring(ix + separator.Length);

                    // And continue 
                    continue;
                }
            }

            // old doesn't contain any separator, let's read some more chars
            int read = tr.ReadBlock(buffer, 0, buffer.Length);

            // If there is no more chars to read, break the cycle
            if (read == 0)
            {
                break;
            }

            // Add the just read chars to the old chars
            // note that null + "somestring" == "somestring"
            old += new string(buffer, 0, read);

            // A new "round" of the while cycle will search for the separator
        }

        // Now we have to handle chars after the last separator

        // If we read something
        if (old != null)
        {
            // Return all the remaining characters
            yield return old;
        }
    }
}

请注意，正如所写，它不会直接处理您的问题:-)但它允许您选择要使用的分隔符。因此，您使用"\r\n"，然后修剪多余的'\r'。

像这样使用：

using (var sr = new StreamReader("somefile"))
{
    // Little LINQ to strip excess \r and to make an array
    // (note that by making an array you'll put all the file
    // in memory)
    string[] lines = sr.ReadLines("\r\n").Select(x => x.TrimEnd('\r')).ToArray();
}

或

using (var sr = new StreamReader("somefile"))
{
    // Little LINQ to strip excess \r
    // (note that the file will be read line by line, so only
    // a line at a time is in memory (plus some remaining characters
    // of the next line in the old buffer)
    IEnumerable<string> lines = sr.ReadLines("\r\n").Select(x => x.TrimEnd('\r'));

    foreach (string line in lines)
    {
        // Do something
    }
}

Answer 2

您总是可以使用BinaryReader并一次手动读取一行字节。保持字节，然后当遇到0x0d 0x0d 0x0a时，为当前行创建一个新的字节字符串。

注意：

我假设你的编码是Encoding.UTF8，但你的情况可能会有所不同。直接访问字节，我不知道如何解释编码。
如果您的文件有额外信息，例如一个byte order mark，也将被退回。

这是：

public static IEnumerable<string> ReadLinesFromStream(string fileName)
{
    using ( var fileStream = File.Open(gstPathFileName) )
    using ( BinaryReader binaryReader = new BinaryReader(fileStream) )
    {
        var bytes = new List<byte>();
        while ( binaryReader.PeekChar() != -1 )
        {
            bytes.Add(binaryReader.ReadByte());

            bool newLine = bytes.Count > 2
                && bytes[bytes.Count - 3] == 0x0d
                && bytes[bytes.Count - 2] == 0x0d
                && bytes[bytes.Count - 1] == 0x0a;

            if ( newLine )
            {
                yield return Encoding.UTF8.GetString(bytes.Take(bytes.Count - 3).ToArray());
                bytes.Clear();
            }
        }

        if ( bytes.Count > 0 )
            yield return Encoding.UTF8.GetString(bytes.ToArray());
    }
}

Answer 3

此代码效果很好...读取每个字符。

char[] acBuf = null;
int iReadLength = 100;
while (srFile.Peek() >= 0) {
    acBuf = new char[iReadLength];
    srFile.Read(acBuf, 0, iReadLength);
    string s = new string(acBuf);
}

Answer 4

一个非常简单的解决方案（未针对内存消耗进行优化）可能是：

var allLines = File.ReadAllText(gstPathFileName)
    .Split('\n');

如果需要删除尾随回车字符，请执行以下操作：

for(var i = 0; i < allLines.Length; ++i)
    allLines[i] = allLines[i].TrimEnd('\r');

如果需要，您可以将相关处理放入for链接中。或者，如果您不想保留数组，请使用此代替for：

foreach(var line in allLines.Select(x => x.TrimEnd('\r')))
{
    // use 'line' here ...
}

StreamReader如何读取所有字符，包括0x0D 0x0A字符？

4 个答案: