将大量文本与125,000个数据库记录进行比较的最快方法

时间:2016-09-06 13:15:20

标签: database vb.net for-loop search text

我的应用程序的目的是从文档中提取文本并搜索与数据库中的记录匹配的特定条目。

  1. 我的应用程序从文档中提取文本并填充文本框 用提取的文本。
  2. 每个文档可以包含200到600,000个单词     (包括大量普通纯文本)。
  3. 将提取的文本与特定的数据库条目进行比较 值和匹配被推送到数组中。
  4. 我的数据库包含大约125,000条记录
  5. 我的代码循环遍历数据库记录,与提取的文本进行比较。如果在文本中找到匹配项,则将其插入到我稍后使用的数组中。

    txtBoxExtraction.Text = "A whole load of text goes in here, " & _
           "including the database entries I am trying to match," & _
           "i.e. AX55F8000AFXZ and PP-Q4681TX/AA up to 600,000 words"
    
    Dim dv As New DataView(_DBASE_ConnectionDataSet.Tables(0))
    dv.Sort = "UNIQUEID"
    
    'There are 125,000 entries here in my sorted DataView dv e.g.
    'AX40EH5300
    'GB46ES6500
    'PP-Q4681TX/AA
    
    For i = 0 to maxFileCount
    
        Dim path As String = Filename(i)
    
        Try
        If File.Exists(path) Then
            Try
               Using sr As New StreamReader(path)
                   txtBoxExtraction.Text = sr.ReadToEnd()
               End using
            Catch e As Exception
               Console.WriteLine("The process failed: {0}", e.ToString())
            End Try
        end if
    
        For dvRow As Integer = 0 To dv.Table.Rows.Count - 1
            strUniqueID = dv.Table.Rows(dvRow)("UNIQUEID").ToString()
            If txtBoxExtraction.Text.ToLower().Contains(strUniqueID.ToLower) Then
                ' Add UniqueID to array and do some other stuff..
            End if
        next dvRow
    
    next i
    

    虽然代码有效,但我正在寻找一种更快的方式来执行数据库匹配('对于dvRow'循环)。

    如果文档很小,大约有200个单词,那么' For dvRow ..'循环在几秒钟内快速完成。

    如果文档包含大量文本... 600,000字及以上,则可能需要几个小时或更长时间才能完成。

    我发现了几个相似的帖子,但与我的问题不太接近,无法实施任何建议。

    High performance "contains" search in list of strings in C# https://softwareengineering.stackexchange.com/questions/118759/how-to-quickly-search-through-a-very-large-list-of-strings-records-on-a-databa

    非常感谢任何帮助。

1 个答案:

答案 0 :(得分:1)

这是一篇写的评论的例子。

  

如果这是实际的代码,我不明白为什么你需要把它   文本框中的信息。你可以节省一点速度   在屏幕上显示文字。如果你有125000个UNIQUEID,那么它   可能更好从您的文件中提取ID然后搜索   那份清单。而不是每次都搜索整个文本。即使只是   按空格分割文本并按“文字”过滤   在特定大小之间可以使它更快。

因为您似乎想要进行单词检查而不是按字符检查。并且您只想检查那些ID而不是每个单词。在进行任何搜索之前,您应该从每个文本中提取ID。这将减少需要大量完成的搜索。如果文本永远不会改变,也可以保存此id列表。

Module Module1

    Private UNIQUEID_MIN_SIZE As Integer = 8
    Private UNIQUEID_MAX_SIZE As Integer = 12

    Sub Main()

        Dim text As String
        Dim startTime As DateTime
        Dim uniqueIds As List(Of String)

        text = GetText()
        uniqueIds = GetUniqueIds()

        '--- Very slow

        startTime = DateTime.Now

        ' Search
        For Each uniqueId As String In uniqueIds
            text.Contains(uniqueId)
        Next

        Console.WriteLine("Took {0}s", DateTime.Now.Subtract(startTime).TotalSeconds)

        '--- Very fast

        startTime = DateTime.Now

        ' Split the text by words
        Dim words As List(Of String) = text.Split(" ").ToList()

        ' Get all the unique key, assuming keys are between a specific size
        Dim uniqueIdInText As New Dictionary(Of String, String)

        For Each word As String In words
            If word.Length < UNIQUEID_MIN_SIZE Or word.Length > UNIQUEID_MAX_SIZE Then
                If Not uniqueIdInText.ContainsKey(word) Then
                    uniqueIdInText.Add(word, "")
                End If
            End If
        Next

        ' Search
        For Each uniqueId As String In uniqueIds
            uniqueIdInText.ContainsKey(uniqueId)
        Next

        Console.WriteLine("Took {0}s", DateTime.Now.Subtract(startTime).TotalSeconds)

        Console.ReadLine()

    End Sub

    ' This only randomly generate words for testing
    ' You can ignore
    Function GetRandomWord(ByVal len As Integer) As String

        Dim builder As New System.Text.StringBuilder
        Dim alphabet As String = "abcdefghijklmnopqrstuvwxyz"
        Dim rnd As New Random()

        For i As Integer = 0 To len - 1
            builder.Append(alphabet.Substring(rnd.Next(0, alphabet.Length - 1), 1))
        Next

        Return builder.ToString()
    End Function

    Function GetText() As String

        Dim builder As New System.Text.StringBuilder
        Dim rnd As New Random()

        For i As Integer = 0 To 600000
            builder.Append(GetRandomWord(rnd.Next(1, 15)))
            builder.Append(" ")
        Next

        Return builder.ToString()
    End Function

    Function GetUniqueIds() As List(Of String)

        Dim wordCount As Integer = 600000
        Dim ids As New List(Of String)
        Dim rnd As New Random()

        For i As Integer = 0 To 125000
            ids.Add(GetRandomWord(rnd.Next(UNIQUEID_MIN_SIZE, UNIQUEID_MAX_SIZE)))
        Next

        Return ids
    End Function

End Module