Question

我正在开发一个日志解析器，我正在读取超过150MB的字符串文件.-这是我的方法，有没有办法优化While语句中的内容？问题是消耗了大量内存.-我也尝试过使用面向相同内存消耗的字符串构建器.-

private void ReadLogInThread()
        {
            string lineOfLog = string.Empty;

            try
            {
                StreamReader logFile = new StreamReader(myLog.logFileLocation);
                InformationUnit infoUnit = new InformationUnit();

                infoUnit.LogCompleteSize = myLog.logFileSize;

                while ((lineOfLog = logFile.ReadLine()) != null)
                {
                    myLog.transformedLog.Add(lineOfLog); //list<string>
                    myLog.logNumberLines++;

                    infoUnit.CurrentNumberOfLine = myLog.logNumberLines;
                    infoUnit.CurrentLine = lineOfLog;
                    infoUnit.CurrentSizeRead += lineOfLog.Length;


                    if (onLineRead != null)
                        onLineRead(infoUnit);
                }
            }
            catch { throw; }
        }

提前致谢！

EXTRA： 我保存每一行因为在阅读完日志之后我需要检查每条存储行的一些信息.-语言是C＃

Answer 1

如果您的日志行实际上可以解析为数据行表示，则可以实现内存经济。

这是我能想到的典型日志行：

活动时间：2019/01/05：0：24：32.435，原因：操作，种类：DataStoreOperation，运营状况：成功

此行在内存中占用200个字节。同时，以下表示只需要16个字节：

Enum LogReason { Operation, Error, Warning };
Enum EventKind short { DataStoreOperation, DataReadOperation };
Enum OperationStatus short { Success, Failed };

LogRow
{
  DateTime EventTime;
  LogReason Reason;
  EventKind Kind;
  OperationStatus Status;
}

另一种优化可能性就是将一行转换为字符串标记数组，这样你就可以利用字符串实习。例如，如果单词“DataStoreOperation”占用36个字节，并且如果文件中有1000000个entiries，则经济为（18 * 2 - 4）* 1000000 = 32 000 000字节。

Answer 2

尝试按顺序制作算法。

如果您不需要按列表中的索引随机访问行，那么使用IEnumerable而不是List有助于与内存一起使用，同时保持与使用列表相同的语义。

IEnumerable<string> ReadLines()
{
  // ...
  while ((lineOfLog = logFile.ReadLine()) != null)
  {
    yield return lineOfLog;
  }
}
//...
foreach( var line in ReadLines() )
{
  ProcessLine(line);
}

Answer 3

我不确定它是否适合您的项目，但您可以将结果存储在StringBuilder而不是字符串列表中。

例如，我的机器上的这个过程在加载后占用250MB内存（文件为50MB）：

static void Main(string[] args)
{
    using (StreamReader streamReader = File.OpenText("file.txt"))
    {
        var list = new List<string>();
        string line;
        while (( line=streamReader.ReadLine())!=null)
        {
            list.Add(line);
        }
    }
}

另一方面，此代码过程仅需100MB：

static void Main(string[] args)
{
    var stringBuilder = new StringBuilder();
    using (StreamReader streamReader = File.OpenText("file.txt"))
    {
        string line;
        while (( line=streamReader.ReadLine())!=null)
        {
            stringBuilder.AppendLine(line);
        }
    }
}

Answer 4

内存使用量不断上升，因为您只是将它们添加到List＆lt; string＆gt;中，不断增长。如果您想使用更少的内存，您可以做的一件事就是将数据写入磁盘，而不是将其保留在范围内。当然，这将极大地降低速度。

另一种选择是在将字符串数据存储到列表中时对其进行压缩，然后将其解压缩，但我认为这不是一个好方法。

旁注：

您需要在streamreader周围添加一个使用块。

using (StreamReader logFile = new StreamReader(myLog.logFileLocation))

Answer 5

考虑这个实现:(我说的是c / c ++，根据需要替换c＃）

Use fseek/ftell to find the size of the file.

Use malloc to allocate a chunk of memory the size of the file + 1;
Set that last byte to '\0' to terminate the string.

Use fread to read the entire file into the memory buffer.
You now have char * which holds the contents of the file as a 
string.

Create a vector of const char * to hold pointers to the positions 
in memory where each line can be found.   Initialize the first element 
of the vector to the first byte of the memory buffer.

Find the carriage control characters (probably \r\n)   Replace the 
\r by \0 to make the line a string.   Increment past the \n.  
This new pointer location is pushed back onto the vector.

Repeat the above until all of the lines in the file have been NUL 
terminated, and are pointed to by elements in the vector.

Iterate though the vector as needed to investigate the contents of 
each line, in your business specific way.

When you are done, close the file, free the memory,  and continue 
happily along your way.

Answer 6

1）在存储字符串之前压缩字符串（即参见System.IO.Compression和GZipStream）。这可能会破坏程序的性能，因为你必须解压缩才能读取每一行。

2）删除任何额外的空格字符或您可以不用的常用字词。也就是说，如果你能用“a，a，of ......”这个词来理解日志的内容，那就把它们删掉吧。此外，缩短任何常用词（即将“错误”更改为“错误”并将“警告”更改为“wrn”）。这会减慢此过程中的这一步骤，但不应影响其余部分的表现。

Answer 7

您的原始文件是什么编码？如果它是ascii那么仅仅字符串就会占用文件大小的2倍，只是为了加载到你的数组中。 C＃字符是2个字节，除了字符之外，C＃string每个字符串还增加了20个字节。

在您的情况下，由于它是一个日志文件，您可能会利用消息中存在大量重复的事实。您最有可能将传入的行解析为数据结构，从而减少内存开销。例如，如果日志文件中有时间戳，则可以将其转换为8 bytes的DateTime值。即使是1/1/10的短时间戳也会增加12个字节到字符串的大小，带有时间信息的时间戳会更长。您的日志流中的其他令牌可能能够以类似的方式转换为代码或枚举。

即使您将值保留为字符串，如果您可以将其分解为大量使用的片段，或者删除根本不需要的样板，您可以减少内存使用量。如果有很多常见的字符串，你可以Intern，只要支付1个字符串，无论你拥有多少字符串。

Answer 8

如果您必须存储原始数据，并假设您的日志主要是ASCII，那么您可以通过在内部存储UTF8字节来节省一些内存。字符串在内部是UTF16，因此您为每个字符存储一个额外的字节。因此，通过切换到UTF8，您可以减少一半的内存使用量（不计算类开销，这仍然很重要）。然后您可以根据需要转换回普通字符串。

static void Main(string[] args)
{
    List<Byte[]> strings = new List<byte[]>();

    using (TextReader tr = new StreamReader(@"C:\test.log"))
    {
        string s = tr.ReadLine();
        while (s != null)
        {
            strings.Add(Encoding.Convert(Encoding.Unicode, Encoding.UTF8, Encoding.Unicode.GetBytes(s)));
            s = tr.ReadLine();
        }
    }

    // Get strings back
    foreach( var str in strings)
    {
        Console.WriteLine(Encoding.UTF8.GetString(str));
    }
}

如何在此算法中优化内存使用？

8 个答案: