从头开始阅读海量文本文件

时间:2012-11-29 07:59:51

标签: vb.net

我会问你是否可以在我的问题上给我一些替代方案。

基本上我正在读取平均为800万行的.txt日志文件。大约600megs的纯原始txt文件。

我目前正在使用streamreader对这800万行进行2次传递,对日志文件中的重要部分进行排序和过滤,但为此,我的计算机需要大约50秒来完成1次完整运行。

我可以优化的一种方法是使第一遍开始读取,因为最重要的数据大约位于最后的200k行。不幸的是,我搜索和streamreader无法做到这一点。有什么想法吗?

一些一般限制

  • 行数变化
  • 文件大小不尽相同
  • 重要数据的位置有所不同,但大约在最后的200k线

这是日志文件第一次传递的循环代码,只是为了给你一个想法

Do Until sr.EndOfStream = True                                                                              'Read whole File
            Dim streambuff As String = sr.ReadLine                                                      'Array to Store CombatLogNames
            Dim CombatLogNames() As String
            Dim searcher As String

    If streambuff.Contains("CombatLogNames flags:0x1") Then                                             'Keyword to Filter CombatLogNames Packets in the .txt

        Dim check As String = streambuff                                                                'Duplicate of the Line being read
        Dim index1 As Char = check.Substring(check.IndexOf("(") + 1)                                    '
        Dim index2 As Char = check.Substring(check.IndexOf("(") + 2)                                    'Used to bypass the first CombatLogNames packet that contain only 1 entry


        If (check.IndexOf("(") <> -1 And index1 <> "" And index2 <> " ") Then                           'Stricter Filters for CombatLogNames

            Dim endCLN As Integer = 0                                                                   'Signifies the end of CombatLogNames Packet
            Dim x As Integer = 0                                                                        'Counter for array

            While (endCLN = 0 And streambuff <> "---- CNETMsg_Tick")                                    'Loops until the end keyword for CombatLogNames is seen

                streambuff = sr.ReadLine                                                                'Reads a new line to flush out "CombatLogNames flags:0x1" which is unneeded
                If ((streambuff.Contains("---- CNETMsg_Tick") = True) Or (streambuff.Contains("ResponseKeys flags:0x0 ") = True)) Then

                    endCLN = 1                                                                          'Value change to determine end of CombatLogName packet

                Else

                    ReDim Preserve CombatLogNames(x)                                                    'Resizes the array while preserving the values
                    searcher = streambuff.Trim.Remove(streambuff.IndexOf("(") - 5).Remove(0, _
                    streambuff.Trim.Remove(streambuff.IndexOf("(")).IndexOf("'"))                       'Additional filtering to get only valuable data
                    CombatLogNames(x) = search(searcher)
                    x += 1                                                                              '+1 to Array counter

                End If
            End While
        Else
            'MsgBox("Something went wrong, Flame the coder of this program!!")                          'Bug Testing code that is disabled
        End If
    Else
    End If

    If (sr.EndOfStream = True) Then

        ReDim GlobalArr(CombatLogNames.Length - 1)                                                      'Resizing the Global array to prime it for copying data
        Array.Copy(CombatLogNames, GlobalArr, CombatLogNames.Length)                                    'Just copying the array to make it global

    End If
Loop

2 个答案:

答案 0 :(得分:1)

您可以将BaseStream设置为所需的读取位置,您只能将其设置为特定的LINE(因为计数行需要读取完整的文件)

    Using sw As New StreamWriter("foo.txt", False, System.Text.Encoding.ASCII)
        For i = 1 To 100
            sw.WriteLine("the quick brown fox jumps ovr the lazy dog")
        Next

    End Using
    Using sr As New StreamReader("foo.txt", System.Text.Encoding.ASCII)
        sr.BaseStream.Seek(-100, SeekOrigin.End)
        Dim garbage = sr.ReadLine ' can not use, because very likely not a COMPLETE line
        While Not sr.EndOfStream
            Dim line = sr.ReadLine
            Console.WriteLine(line)
        End While
    End Using

对于同一文件的任何后续读取尝试,您可以简单地保存(基本流的)最终位置,并在下一次读取之前,在开始读取行之前前进到该位置。

答案 1 :(得分:0)

对我来说有用的是跳过前4M行(只是一个简单的计数器&gt; 4M围绕循环内的所有内容),然后添加进行过滤的后台工作者,如果重要的是将行添加到数组中,而main线程继续阅读线条。这节省了大约三分之一的时间。