实现文本文件的异步搜索

时间:2015-01-09 22:46:08

标签: vb.net search asynchronous task-parallel-library tpl-dataflow

我正在创建一个Windows表单应用程序,允许用户将文本文件指定为数据源,根据文件中的列数动态创建表单控件,并允许用户输入搜索参数用于在单击搜索按钮时搜索文件。任何结果都将写入新的文本文件。

此程序将搜索的文件通常非常大(最多12 GB)。我当前的搜索方法(读取一行,搜索它,将其添加到结果文件中,如果它是一个命中)对于合理大小的文件(几MB左右)非常有效。使用我的“大”测试文件(~2.5 GB),搜索文件大约需要12分钟。

所以我的问题是:提高性能的最佳方法是什么?经过大量的搜索和阅读,我知道我有以下选择:

  • 异步方法
  • 任务
  • TPL数据流
  • 这些方法的某些组合

由于我的程序逻辑更像是一个流,我倾向于数据流,但我不确定如何正确实现它或者是否有更好的解决方案。下面是搜索按钮的clickEvent和与搜索相关的功能的代码。

'Searches the loaded file
    Private Sub searchBtn_Click(sender As Object, e As EventArgs) Handles searchBtn.Click
        Dim strFileName As String
        Dim didWork As Integer
        Dim searchHits As Integer
        Dim watch As Stopwatch = Stopwatch.StartNew()

        'Prompts user to enter title of file to be created
        exportFD.Title = "Save as. . ."
        exportFD.Filter = "Text Files(*.txt)|*.txt" 'Limits user to only saving as .txt file
        exportFD.ShowDialog()

        If didWork = DialogResult.Cancel Then 'Handles if Cancel Button is clicked
            Return
        Else
            strFileName = exportFD.FileName
            Dim writer As New IO.StreamWriter(strFileName, False) 
            Dim reader As New IO.StreamReader(filepath)
            Dim currentLine As String

            'Skip first line of SOURCE text file for search, but use it to write column headers to file
            currentLine = reader.ReadLine()
            Dim columnLine = currentLine.Split(vbTab)

            'First: Insert column names into NEW text file
            For col As Integer = 0 To colCount - 1
                writer.Write(columnLine(col) & vbTab)
            Next
            writer.Write(vbNewLine)

            'Search whole file, line by line
            Do While reader.Peek() > 0
                'next line
                currentLine = reader.ReadLine()

                'new function:
                If validChromosome(currentLine) Then
                    writer.WriteLine(currentLine)
                    searchHits += 1
                End If
            Loop

            'Close out writer and reader and tell user file was saved
            writer.Close()
            reader.Close()
            searchTxtB.Text = searchHits.ToString()
            watch.Stop()
            MsgBox("Searched in: " + watch.Elapsed.ToString() + " and saved to: " + strFileName)
        End If

    End Sub

    'This function searches through the current line and checks if it follows what the user has searched for
    Private Function validChromosome(chromString As String) As Boolean

        'Split line by delimiter
        Dim readRow() As String = Split(chromString, vbTab)
        validChromosome = True 'Start off as true

        Dim rowLength As Integer = readRow.Length - 1

        'Iterate through string tokens and compare 
        For token As Integer = 0 To rowLength
            Try
                Dim currentGroupBox As GroupBox = criteriaPanel.Controls.Item(token)
                Dim checkedParameter As CheckBox = currentGroupBox.Controls("CheckBox")

                'User wants to search this parameter
                If checkedParameter.Checked = True Then
                    Dim numericRadio As RadioButton = currentGroupBox.Controls("NumericRadio")

                    'Searching by number
                    If numericRadio.Checked = True Then
                        Dim value As Decimal
                        Dim lowerBox As NumericUpDown = currentGroupBox.Controls("NumericBoxLower")
                        Dim upperBox As NumericUpDown = currentGroupBox.Controls("NumericBoxUpper")

                        Dim lowerInclusiveCheck As CheckBox = currentGroupBox.Controls("NumericInclusiveLowerCheckBox")
                        Dim upperInclusiveCheck As CheckBox = currentGroupBox.Controls("NumericInclusiveUpperCheckBox")

                        'Try to convert the text to a decimal. 
                        If Not Decimal.TryParse(readRow(token), value) Then
                            validChromosome = False
                            Exit For
                        End If

                       'Not within the given range user inputted for numeric search
                        If Not withinRange(value, lowerBox.Value, upperBox.Value, lowerInclusiveCheck.Checked, upperInclusiveCheck.Checked) Then
                            validChromosome = False
                            Exit For
                        End If

                    Else 'Searching by text
                        Dim textBox As TextBox = currentGroupBox.Controls("TextBox")

                        'If the comparison failed, then this chromosome is not valid. Break out of loop and return false.
                        If Not [String].Equals(readRow(token), textBox.Text.ToString(), StringComparison.OrdinalIgnoreCase) Then

                            validChromosome = False
                            Exit For

                        End If
                    End If

                End If


            Catch ex As Exception

                'Simple error checking.
                MsgBox(ex.ToString)
                validChromosome = False
                Exit For

            End Try
        Next

    End Function

    'Function to check if value safely in betweeen two values
    Private Function withinRange(value As Decimal, lower As Decimal, upper As   Decimal, inclusiveLower As Boolean, inclusiveUpper As Boolean) As Boolean
        withinRange = False
        Dim lowerCheck As Boolean = False
        Dim upperCheck As Boolean = False

        If inclusiveLower Then
            lowerCheck = value >= lower
        Else
            lowerCheck = value > lower
        End If

        If inclusiveUpper Then
            upperCheck = value <= upper
        Else
            upperCheck = value < upper
        End If

        withinRange = lowerCheck And upperCheck

    End Function

我当前的理论是我应该创建一个TransformBlock,它将包含我的文件读取方法并创建一个小缓冲区(~10行),这些缓冲区将传递给另一个搜索它们的TransformBlock并将结果放入列表中然后传递给另一个TransformBlock以写入导出文件。

我的搜索功能(validChromosome)很可能不是很好,所以任何有关改进的建议也会受到欢迎。这是我的第一个程序,我知道VB.net可能不是文本文件操作的最佳语言,但我被迫使用它。在此先感谢您的帮助,如果需要更多信息,请告诉我。

1 个答案:

答案 0 :(得分:0)

TPL Dataflow似乎非常适合,特别是因为它很容易支持async

我会保持读取顺序,因为HD在并发读取中大多不能很好地执行,因此不需要块,只需在while循环中读取缓冲区并发布到TDF块。然后你可以有一个TransformBlock来搜索那个缓冲区并将结果移动到保存到文件的下一个块。

TransfromBlock可以并行运行,因此您应该设置相应的MaxDegreeOfParallelism(可能是Environment.ProcessorCount)。