加速大字符串数据解析器功能

时间:2013-10-18 04:16:02

标签: vb.net

我目前有一个包含100万个字符的文件..文件大小为1 MB。我试图用这个旧功能解析数据,这个功能仍然有效但很慢。

start0end
start1end
start2end
start3end
start4end
start5end
start6end

代码,处理整个数据大约需要5分钟。 任何指针和建议都表示赞赏。

Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
    Dim sFinal = ""
    Dim strData = textbox.Text
    Dim strFirst = "start"
    Dim strSec = "end"

    Dim strID As String, Pos1 As Long, Pos2 As Long, strCur As String = ""

    Do While InStr(strData, strFirst) > 0
        Pos1 = InStr(strData, strFirst)
        strID = Mid(strData, Pos1 + Len(strFirst))
        Pos2 = InStr(strID, strSec)

        If Pos2 > 0 Then
            strID = Microsoft.VisualBasic.Left(strID, Pos2 - 1)
        End If

        If strID <> strCur Then
            strCur = strID

            sFinal += strID & ","
        End If

        strData = Mid(strData, Pos1 + Len(strFirst) + 3 + Len(strID))
    Loop
End Sub

1 个答案:

答案 0 :(得分:2)

这么慢的原因是因为你一直在破坏并重新创建1 MB的字符串。字符串是不可变的,因此strData = Mid(strData...创建一个新字符串,并将剩余的1 MB字符串数据一遍又一遍地复制到新的strData变量中。有趣的是,即使是VB6也允许进步指数。

我会处理磁盘文件LINE BY LINE并在读取时读取信息(请参阅streamreader.ReadLine)以避免使用1MB字符串。几乎可以在那里使用相同的方法。

' 1 MB textbox data (!?)
Dim sData As String = TextBox1.Text
' start/stop - probably fake
Dim sStart As String = "start"
Dim sStop As String = "end"

' result
Dim sbResult As New StringBuilder
' progressive index
Dim nNDX As Integer = 0

' shortcut at least as far as typing and readability
Dim MagicNumber As Integer = sStart.Length
' NEXT index of start/stop after nNDX
Dim i As Integer = 0
Dim j As Integer = 0

' loop as long as string remains 
 Do While (nNDX < sData.Length) AndAlso (i >= 0)
    i = sData.IndexOf(sStart, nNDX)             ' start index
    j = sData.IndexOf(sStop, i)                 ' stop index

    ' Extract and append bracketed substring 
    sbResult.Append(sData.Substring(i + MagicNumber, j - (i + MagicNumber)))
    ' add a cute comma
    sbResult.Append(",")

    nNDX = j                               ' where we start next time
    i = sData.IndexOf(sStart, nNDX)
 Loop

 ' remove last comma
 sbResult.Remove(sbResult.ToString.Length - 1, 1)

 ' show my work
 Console.WriteLine(sbResult.ToString)

编辑:临时测试数据的小模型