获取最后10行非常大的文本文件> 10GB

时间:2008-12-29 19:16:08

标签: c# text large-files


21 个答案:

尾? Tail是一个unix命令,它将显示文件的最后几行。 Windows 2003 Server resource kit中有一个Windows版本。

正如其他人所建议的那样,您可以转到文件的末尾并有效地向后阅读。然而,它有点棘手 - 特别是因为如果你有一个可变长度编码(如UTF-8),你需要狡猾地确保你得到“整个”字符。

你应该可以使用FileStream.Seek()移动到文件的末尾,然后向后工作,寻找\ n直到你有足够的行。

我不确定它的效率如何,但在Windows PowerShell中获取文件的最后十行就像

Get-Content file.txt | Select-Object -last 10

这就是unix tail命令的作用。见http://en.wikipedia.org/wiki/Tail_(Unix)

互联网上有很多开源实现,这里有一个用于win32:Tail for WIn32

StreamReader reader = new StreamReader(@"c:\test.txt"); //pick appropriate Encoding
reader.BaseStream.Seek(0, SeekOrigin.End);
int count = 0;
while ((count < 10) && (reader.BaseStream.Position > 0))
    int c = reader.BaseStream.ReadByte();
    if (reader.BaseStream.Position > 0)
    if (c == Convert.ToInt32('\n'))
string str = reader.ReadToEnd();
string[] arr = str.Replace("\r", "").Split('\n');

这是我的版本。 HTH

using (StreamReader sr = new StreamReader(path))
  sr.BaseStream.Seek(0, SeekOrigin.End);

  int c;
  int count = 0;
  long pos = -1;

  while(count < 10)
    sr.BaseStream.Seek(pos, SeekOrigin.End);
    c = sr.Read();

    if(c == Convert.ToInt32('\n'))

  sr.BaseStream.Seek(pos, SeekOrigin.End);
  string str = sr.ReadToEnd();
  string[] arr = str.Split('\n');

网上有大量的尾部实现 - 看看源代码,看看他们如何做到这一点。 Tail是非常有效的(即使是非常大的文件),所以他们写作时必须正确!

您的档案是什么结构?你确定最后10行会接近文件的末尾吗?如果你有一个包含12行文本和10GB 0的文件,那么查看结尾并不会那么快。然后,您可能需要查看整个文件。


打开文件并开始阅读行。在你读完10行后,从文件的前面开始打开另一个指针,所以第二个指针滞后于第一行10行。继续阅读,同时移动两个指针,直到第一个到达文件的末尾。然后使用第二个指针读取结果。它适用于任何大小的文件,包括空和短于尾长。并且可以轻松调整任何长度的尾巴。 当然,缺点是您最终会阅读整个文件,这可能正是您要避免的。

StreamReader leader = new StreamReader(GetReadFile);
leader.BaseStream.Position = 0;
StreamReader follower = new StreamReader(GetReadFile);

int count = 0;
string tmper = null;
while (count <= 12)
    tmper = leader.ReadLine();

long total = follower.BaseStream.Length; // get total length of file
long step = tmper.Length; // get length of 1 line
long size = total / step; // divide to get number of lines
long go = step * (size - 12); // get the bit location

long cut = follower.BaseStream.Seek(go, SeekOrigin.Begin); // Go to that location
follower.BaseStream.Position = go;

string led = null;
string[] lead = null ;
List<string[]> samples = new List<string[]>();


while (!follower.EndOfStream)
    led = follower.ReadLine();
    lead = Tokenize(led);

使用Sisutil的答案作为起点,您可以逐行读取文件并将其加载到Queue<String>。它确实从一开始就读取文件,但它具有不尝试向后读取文件的优点。如果像Jon Skeet所指出的那样,如果你有一个像UTF-8那样的可变字符宽度编码的文件,这可能会非常困难。它也不对线长做任何假设。


int numberOfLines = 10;
string fullFilePath = @"C:\Your\Large\File\BigFile.txt";
var queue = new Queue<string>(numberOfLines);

using (FileStream fs = File.Open(fullFilePath, FileMode.Open, FileAccess.Read, FileShare.Read)) 
using (BufferedStream bs = new BufferedStream(fs))  // May not make much difference.
using (StreamReader sr = new StreamReader(bs)) {
    while (!sr.EndOfStream) {
        if (queue.Count == numberOfLines) {


// The queue now has our set of lines. So print to console, save to another file, etc.
do {
} while (queue.Count > 0);    

这对我来说非常有效和快速(我忘了关闭流 - 现在修复):

    private string tail(StreamReader streamReader, long numberOfBytesFromEnd)
        Stream stream = streamReader.BaseStream;
        long length = streamReader.BaseStream.Length;
        if (length < numberOfBytesFromEnd)
            numberOfBytesFromEnd = length;
        stream.Seek(numberOfBytesFromEnd * -1, SeekOrigin.End);

        int LF = '\n';
        int CR = '\r';
        bool found = false;

        while (!found) {
            int c = stream.ReadByte();
            if (c == LF)
                found = true;

        string readToEnd = streamReader.ReadToEnd();
        return readToEnd;


这并不能真正指定结尾的线条数量,这无论如何都不是一个好主意,因为这些线条可能是任意长的,因此会再次破坏性能。所以我指定了字节数,在我们到达第一个换行符之前读取,并且读到最后。 理论上,您也可以查找CarriageReturn,但在我的情况下,这不是必需的。


        FileStream fileStream = new FileStream(

        StreamReader streamReader = new StreamReader(fileStream);

var reader = new ReverseTextReader(@"C:\Temp\ReverseTest.txt");
while (!reader.EndOfStream)


/// <summary>
/// Reads a text file backwards, line-by-line.
/// </summary>
/// <remarks>This class uses file seeking to read a text file of any size in reverse order.  This
/// is useful for needs such as reading a log file newest-entries first.</remarks>
public sealed class ReverseTextReader : IEnumerable<string>
    private const int BufferSize = 16384;   // The number of bytes read from the uderlying stream.
    private readonly Stream _stream;        // Stores the stream feeding data into this reader
    private readonly Encoding _encoding;    // Stores the encoding used to process the file
    private byte[] _leftoverBuffer;         // Stores the leftover partial line after processing a buffer
    private readonly Queue<string> _lines;  // Stores the lines parsed from the buffer

    #region Constructors

    /// <summary>
    /// Creates a reader for the specified file.
    /// </summary>
    /// <param name="filePath"></param>
    public ReverseTextReader(string filePath)
        : this(new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read), Encoding.Default)
    { }

    /// <summary>
    /// Creates a reader using the specified stream.
    /// </summary>
    /// <param name="stream"></param>
    public ReverseTextReader(Stream stream)
        : this(stream, Encoding.Default)
    { }

    /// <summary>
    /// Creates a reader using the specified path and encoding.
    /// </summary>
    /// <param name="filePath"></param>
    /// <param name="encoding"></param>
    public ReverseTextReader(string filePath, Encoding encoding)
        : this(new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read), encoding)
    { }

    /// <summary>
    /// Creates a reader using the specified stream and encoding.
    /// </summary>
    /// <param name="stream"></param>
    /// <param name="encoding"></param>
    public ReverseTextReader(Stream stream, Encoding encoding)
        _stream = stream;
        _encoding = encoding;
        _lines = new Queue<string>(128);            
        // The stream needs to support seeking for this to work
            throw new InvalidOperationException("The specified stream needs to support seeking to be read backwards.");
        if (!_stream.CanRead)
            throw new InvalidOperationException("The specified stream needs to support reading to be read backwards.");
        // Set the current position to the end of the file
        _stream.Position = _stream.Length;
        _leftoverBuffer = new byte[0];


    #region Overrides

    /// <summary>
    /// Reads the next previous line from the underlying stream.
    /// </summary>
    /// <returns></returns>
    public string ReadLine()
        // Are there lines left to read? If so, return the next one
        if (_lines.Count != 0) return _lines.Dequeue();
        // Are we at the beginning of the stream? If so, we're done
        if (_stream.Position == 0) return null;

        #region Read and Process the Next Chunk

        // Remember the current position
        var currentPosition = _stream.Position;
        var newPosition = currentPosition - BufferSize;
        // Are we before the beginning of the stream?
        if (newPosition < 0) newPosition = 0;
        // Calculate the buffer size to read
        var count = (int)(currentPosition - newPosition);
        // Set the new position
        _stream.Position = newPosition;
        // Make a new buffer but append the previous leftovers
        var buffer = new byte[count + _leftoverBuffer.Length];
        // Read the next buffer
        _stream.Read(buffer, 0, count);
        // Move the position of the stream back
        _stream.Position = newPosition;
        // And copy in the leftovers from the last buffer
        if (_leftoverBuffer.Length != 0)
            Array.Copy(_leftoverBuffer, 0, buffer, count, _leftoverBuffer.Length);
        // Look for CrLf delimiters
        var end = buffer.Length - 1;
        var start = buffer.Length - 2;
        // Search backwards for a line feed
        while (start >= 0)
            // Is it a line feed?
            if (buffer[start] == 10)
                // Yes.  Extract a line and queue it (but exclude the \r\n)
                _lines.Enqueue(_encoding.GetString(buffer, start + 1, end - start - 2));
                // And reset the end
                end = start;
            // Move to the previous character
        // What's left over is a portion of a line. Save it for later.
        _leftoverBuffer = new byte[end + 1];
        Array.Copy(buffer, 0, _leftoverBuffer, 0, end + 1);
        // Are we at the beginning of the stream?
        if (_stream.Position == 0)
            // Yes.  Add the last line.
            _lines.Enqueue(_encoding.GetString(_leftoverBuffer, 0, end - 1));


        // If we have something in the queue, return it
        return _lines.Count == 0 ? null : _lines.Dequeue();


    #region IEnumerator<string> Interface

    public IEnumerator<string> GetEnumerator()
        string line;
        // So long as the next line isn't null...
        while ((line = ReadLine()) != null)
            // Read and return it.
            yield return line;

    IEnumerator IEnumerable.GetEnumerator()
        throw new NotImplementedException();


使用PowerShell,Get-Content big_file_name.txt -Tail 10,其中10是要检索的底行数。

这没有性能问题。我在超过100 GB的文本文件上运行它,并获得了即时结果。

private string ReadRows(int offset)     /*offset: how many lines it reads from the end (10 in your case)*/
    /*no lines to read*/
    if (offset == 0)
        return result;

    using (FileStream fs = new FileStream(FullName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 2048, true))
        List<char> charBuilder = new List<char>(); /*StringBuilder doesn't work with Encoding: example char ? */
        StringBuilder sb = new StringBuilder();

        int count = 0;

        /*tested with utf8 file encoded by notepad-pp; other encoding may not work*/

        var decoder = ReaderEncoding.GetDecoder();
        byte[] buffer;
        int bufferLength;

        fs.Seek(0, SeekOrigin.End);

        while (true)
            bufferLength = 1;
            buffer = new byte[1];

            /*for encoding with variable byte size, every time I read a byte that is part of the character and not an entire character the decoder returns '�' (invalid character) */

            char[] chars = { '�' }; //� 65533
            int iteration = 0;

            while (chars.Contains('�'))
                /*at every iteration that does not produce character, buffer get bigger, up to 4 byte*/
                if (iteration > 0)
                    bufferLength = buffer.Length + 1;

                    byte[] newBuffer = new byte[bufferLength];

                    Array.Copy(buffer, newBuffer, bufferLength - 1);

                    buffer = newBuffer;

                /*there are no characters with more than 4 bytes in utf-8*/
                if (iteration > 4)
                    throw new Exception();

                /*if all is ok, the last seek return IOError with chars = empty*/
                    fs.Seek(-(bufferLength), SeekOrigin.Current);
                    chars = new char[] { '\0' };

                fs.Read(buffer, 0, bufferLength);

                var charCount = decoder.GetCharCount(buffer, 0, bufferLength);
                chars = new char[charCount];

                decoder.GetChars(buffer, 0, bufferLength, chars, 0);


            /*when i get a char*/
            charBuilder.InsertRange(0, chars);

            if (chars.Length > 0 && chars[0] == '\n')

            /*exit when i get the correctly number of line (*last row is in interval)*/
            if (count == offset + 1)

            /*the first search goes back, the reading goes on then we come back again, except the last */
                fs.Seek(-(bufferLength), SeekOrigin.Current);
            catch (Exception)


    /*everithing must be reversed, but not \0*/

    return new string(charBuilder.ToArray());


答案 20 :(得分:-11)



