如何从文本文件中的指定偏移量中找到子串的第n个出现位置?

时间:2018-02-26 13:28:26

标签: vb.net search full-text-search indexof

假设存在具有可变长度行的未格式化原始文本文件,由CR / LF序列分隔。 FileStream对象名为oFS。

我想从一个名为lCandidate的Long中保存的位置开始,以String数组的形式提取它的一部分。 lCandidate保证保持0到oFS.Length-1范围内的值。

可选地,Long lMaxLines可以包含大于或等于0的正数。如果为0,则从位置lCandidate开始直到需要返回EOF的所有行,否则只有lCandidate和第n个CR / LF出现之间的文本 - 需要返回lCandidate之后的序列。

lMaxLines = 0,这不是问题。使用lMaxLines> 0,我可以编写一个IndexOf循环并递减lMaxLines直到它达到0,此时我已经检测到了最终位置。但是,这个数字可能非常大(例如6位数字)。此搜索是函数的瓶颈(lCandidate是二进制的。)

我的问题是:是否有更直接的方法来找到某个偏移后CR / LF序列第n次出现的位置?

Imports System.IO
Imports System.Text.Encoding

Private Shared Function pExtractLogLines(sLogFile As String,
    sEarliest As String, lMaxLines As Long, lKeyStart As Long) As String()

    Dim oFS As New FileStream(sLogFile, FileMode.Open, FileAccess.Read,
        FileShare.Read)             'The log file as file stream.
    Dim abKey() As Byte             'Reading the current key.
    Dim lCandidate As Long          'File position of promising candidate.
    Dim sRecords As String          'All wanted records.

    ...

    'lCandidate points to a position before the first record we are 
    'interested in, which commences after the next CR/LF sequence.
    'Note, that we need the final CR/LF here, so that the search for the next
    'CR/LF sequence following below will match a valid first entry even in 
    'case there are no entries to be returned (sEarliest being larger than 
    'the last log line). 
    oFS.Seek(lCandidate, SeekOrigin.Begin)      'Position the stream.
    If lMaxLines = 0 Then
        'There is no limit in the number of records to be returned. Return 
        'all records until EOF.
        ReDim abKey(CInt(oFS.Length - lCandidate - 1))      '0-based.
        oFS.Read(abKey, 0, CInt(oFS.Length - lCandidate))
    Else
        'Only a maximum of lMaxLines variable length records need to be 
        'returned.

是否有更快的方式访问由CR / LF序列分隔的数千个可变长度记录,而不是使用循环来执行数千个IndexOf语句?

    End If

    'We're done with the stream.
    oFS.Close()

    'Convert into a string, but omit the first (partial) line, then return
    'as a string array split at CR/LF, without the empty last entry.
    sRecords = UTF8.GetString(abKey)
    sRecords = sRecords.Substring(sRecords.IndexOf(Chr(10)) + 1)

    Return sRecords.Split(ControlChars.CrLf.ToCharArray(),
        StringSplitOptions.RemoveEmptyEntries)
End Function

0 个答案:

没有答案