我有一个简单的.txt日志文件,应用程序在其工作时添加行。这些行由时间戳和可变长度文本组成:
17-06-25 06:37:43 xxxxxxxxxxxxxxx
17-06-25 06:37:46 yyyyyyy
17-06-25 06:37:50 zzzzzzzzzzzzzzzzzzzzzzzzzzzz
...
我需要提取时间戳大于特定日期时间的所有行。这通常是最后的20-40个日志条目(行)。
问题是,文件很大且不断增长。
如果所有长度相等,我会调用二进制搜索。但他们不是,所以我最终使用了类似的东西:
Private Sub ExtractNewestLogs(dEarliest As Date)
Dim sLine As String = ""
Dim oSRLog As New StreamReader(gsFilLog)
sLine = oSRLog.ReadLine()
Do While Not (sLine Is Nothing)
Debug.Print(sLine)
sLine = oSRLog.ReadLine()
Loop
End Sub
,嗯,真的很快。
有没有一种方法可以让我读取这些文件"向后",即最后一行?如果没有,我还有其他选择吗?
答案 0 :(得分:1)
下面的函数将使用二进制阅读器将文件中的最后x
个字符数作为字符串数组返回。然后,您可以比读取整个日志文件更快地提取所需的最后记录。您可以根据最后20-40个日志条目占用的字节的粗略近似值来微调要读取的字节数。在我的电脑上 - 读取17mb文本文件的最后10,000个字符花了<10ms。
当然,此代码假定您的日志文件是纯文本ascii文本。
Private Function ReadLastbytes(filePath As String, x As Long) As String()
Dim fileData(x - 1) As Byte
Dim tempString As New StringBuilder
Dim oFileStream As New FileStream(filePath, FileMode.Open, FileAccess.Read)
Dim oBinaryReader As New BinaryReader(oFileStream)
Dim lBytes As Long
If oFileStream.Length > x Then
lBytes = oFileStream.Length - x
Else
lBytes = oFileStream.Length
End If
oBinaryReader.BaseStream.Seek(lBytes, SeekOrigin.Begin)
fileData = oBinaryReader.ReadBytes(lBytes)
oBinaryReader.Close()
oFileStream.Close()
For i As Integer = 0 To fileData.Length - 1
If fileData(i)=0 Then i+=1
tempString.Append(Chr(fileData(i)))
Next
Return tempString.ToString.Split(vbCrLf)
End Function
答案 1 :(得分:0)
无论如何我尝试了二进制搜索,尽管文件没有静态行长度。
首先考虑一下,然后是代码:
有时需要根据行开头的升序排序键提取日志文件的最后n行。密钥实际上可以是任何东西,但在日志文件中通常表示日期时间,通常采用YYMMDDHHNNSS格式(可能有一些交互)。
日志文件通常是基于文本的文件,由多行组成,有时数百万行。通常,日志文件具有固定长度的线宽,在这种情况下,使用二进制搜索很容易访问特定键。但是,日志文件可能也常常具有可变的行宽。要访问这些,可以使用平均线宽的估计值来计算文档位置,然后从那里顺序处理到EOF。
但是,对于这种类型的文件,也可以使用二进制方法,如此处所示。一旦文件大小增加,优势就出现了。日志文件的最大大小由文件系统决定:理论上,NTFS允许16 EiB(16 x 2 ^ 60 B);在Windows 8或Server 2012的实践中,它是256 TiB(256 x 2 ^ 40 B)。
(256 TiB实际上意味着什么:典型的日志文件设计为人类可读,每行很少超过80个字符。让我们假设您的日志文件快乐且完全不间断地记录了12年,令人惊讶总共4,383天,每个86,400秒,然后您的应用程序允许每毫秒写入9个条目到该日志文件中,最终在其第13年达到256 TiB限制。)
二进制方法的巨大优势在于,n比较足以满足由2 ^ n字节组成的日志文件,随着文件大小变大而迅速获得优势:而文件大小为1 KiB需要进行10次比较(1根据102.4 B),1 MiB(每50 KiB 1个),1个GiB(每33个MiB 1个)需要20个比较,而对于大小为1 TiB(每25 GiB 1个)的文件仅进行40次比较。
到功能。做出以下假设:日志文件以UTF8编码,日志行由CR / LF序列分隔,时间戳以升序排列在每行的开头,可能采用[YY] YYMMDDHHNNSS格式,可能两者之间有一些相互关系。 (所有这些假设都可以通过重载函数调用轻松修改和处理。)
在外循环中,通过比较提供的最早的匹配日期时间来完成二进制缩小。一旦在二进制中找到流中的新位置,就在内循环中进行独立的前向搜索以定位下一个CR / LF序列。此序列后面的字节标记了要比较的记录键的开头。如果此键大于或等于我们要搜索的键,则忽略该键。只有当找到的密钥小于我们搜索其位置的密钥时,才会将其视为我们想要的记录之前的记录的可能条件。我们最终记录的最大密钥小于搜索密钥。
最后,除最终候选者之外的所有日志记录都将作为字符串数组返回给调用者。
该功能需要导入System.IO。
Imports System.IO
'This function expects a log file which is organized in lines of varying
'lengths, delimited by CR/LF. At the start of each line is a sort criterion
'of any kind (in log files typically YYMMDD HHMMSS), by which the lines are
'sorted in ascending order (newest log line at the end of the file). The
'earliest match allowed to be returned must be provided. From this the sort
'key's length is inferred. It needs not to exist neccessarily. If it does,
'it can occur multiple times, as all other sort keys. The returned string
'array contains all these lines, which are larger than the last one found to
'be smaller than the provided sort key.
Public Shared Function ExtractLogLines(sLogFile As String,
sEarliest As String) As String()
Dim oFS As New FileStream(sLogFile, FileMode.Open, FileAccess.Read,
FileShare.Read) 'The log file as file stream.
Dim lMin, lPos, lMax As Long 'Examined stream window.
Dim i As Long 'Iterator to find CR/LF.
Dim abEOL(0 To 1) As Byte 'Bytes to find CR/LF.
Dim abCRLF() As Byte = {13, 10} 'Search for CR/LF.
Dim bFound As Boolean 'CR/LF found.
Dim iKeyLen As Integer = sEarliest.Length 'Length of sort key.
Dim sActKey As String 'Key of examined log record.
Dim abKey() As Byte 'Reading the current key.
Dim lCandidate As Long 'File position of promising candidate.
Dim sRecords As String 'All wanted records.
'The byte array accepting the records' keys is as long as the provided
'key.
ReDim abKey(0 To iKeyLen - 1) '0-based!
'We search the last log line, whose sort key is smaller than the sort
'provided in sEarliest.
lMin = 0 'Start at stream start
lMax = oFS.Length - 1 - 2 '0-based, and without terminal CRLF.
Do
lPos = (lMax - lMin) \ 2 + lMin 'Position to examine now.
'Although the key to be compared with sEarliest is located after
'lPos, it is important, that lPos itself is not modified when
'searching for the key.
i = lPos 'Iterator for the CR/LF search.
bFound = False
Do While i < lMax
oFS.Seek(i, SeekOrigin.Begin)
oFS.Read(abEOL, 0, 2)
If abEOL.SequenceEqual(abCRLF) Then 'CR/LF found.
bFound = True
Exit Do
End If
i += 1
Loop
If Not bFound Then
'Between lPos and lMax no more CR/LF could be found. This means,
'that the search is over.
Exit Do
End If
i += 2 'Skip CR/LF.
oFS.Seek(i, SeekOrigin.Begin) 'Read the key after the CR/LF
oFS.Read(abKey, 0, iKeyLen) 'into a string.
sActKey = System.Text.Encoding.UTF8.GetString(abKey)
'Compare the actual key with the earliest key. We want to find the
'largest key just before the earliest key.
If sActKey >= sEarliest Then
'Not interested in this one, look for an earlier key.
lMax = lPos
Else
'Possibly interesting, remember this.
lCandidate = i
lMin = lPos
End If
Loop While lMin < lMax - 1
'lCandidate is the position of the first record to be taken into account.
'Note, that we need the final CR/LF here, so that the search for the
'next CR/LF sequence following below will match a valid first entry even
'in case there are no entries to be returned (sEarliest being larger than
'the last log line).
ReDim abKey(CInt(oFS.Length - lCandidate - 1)) '0-based.
oFS.Seek(lCandidate, SeekOrigin.Begin)
oFS.Read(abKey, 0, CInt(oFS.Length - lCandidate))
'We're done with the stream.
oFS.Close()
'Convert into a string, but omit the first line, then return as a
'string array split at CR/LF, without the empty last entry.
sRecords = (System.Text.Encoding.UTF8.GetString(abKey))
sRecords = sRecords.Substring(sRecords.IndexOf(Chr(10)) + 1)
Return sRecords.Split(ControlChars.CrLf.ToCharArray(),
StringSplitOptions.RemoveEmptyEntries)
End Function