我的应用程序的目的是从文档中提取文本并搜索与数据库中的记录匹配的特定条目。
我的代码循环遍历数据库记录,与提取的文本进行比较。如果在文本中找到匹配项,则将其插入到我稍后使用的数组中。
txtBoxExtraction.Text = "A whole load of text goes in here, " & _
"including the database entries I am trying to match," & _
"i.e. AX55F8000AFXZ and PP-Q4681TX/AA up to 600,000 words"
Dim dv As New DataView(_DBASE_ConnectionDataSet.Tables(0))
dv.Sort = "UNIQUEID"
'There are 125,000 entries here in my sorted DataView dv e.g.
'AX40EH5300
'GB46ES6500
'PP-Q4681TX/AA
For i = 0 to maxFileCount
Dim path As String = Filename(i)
Try
If File.Exists(path) Then
Try
Using sr As New StreamReader(path)
txtBoxExtraction.Text = sr.ReadToEnd()
End using
Catch e As Exception
Console.WriteLine("The process failed: {0}", e.ToString())
End Try
end if
For dvRow As Integer = 0 To dv.Table.Rows.Count - 1
strUniqueID = dv.Table.Rows(dvRow)("UNIQUEID").ToString()
If txtBoxExtraction.Text.ToLower().Contains(strUniqueID.ToLower) Then
' Add UniqueID to array and do some other stuff..
End if
next dvRow
next i
虽然代码有效,但我正在寻找一种更快的方式来执行数据库匹配('对于dvRow'循环)。
如果文档很小,大约有200个单词,那么' For dvRow ..'循环在几秒钟内快速完成。
如果文档包含大量文本... 600,000字及以上,则可能需要几个小时或更长时间才能完成。
我发现了几个相似的帖子,但与我的问题不太接近,无法实施任何建议。
High performance "contains" search in list of strings in C# https://softwareengineering.stackexchange.com/questions/118759/how-to-quickly-search-through-a-very-large-list-of-strings-records-on-a-databa
非常感谢任何帮助。
答案 0 :(得分:1)
这是一篇写的评论的例子。
如果这是实际的代码,我不明白为什么你需要把它 文本框中的信息。你可以节省一点速度 在屏幕上显示文字。如果你有125000个UNIQUEID,那么它 可能更好从您的文件中提取ID然后搜索 那份清单。而不是每次都搜索整个文本。即使只是 按空格分割文本并按“文字”过滤 在特定大小之间可以使它更快。
因为您似乎想要进行单词检查而不是按字符检查。并且您只想检查那些ID而不是每个单词。在进行任何搜索之前,您应该从每个文本中提取ID。这将减少需要大量完成的搜索。如果文本永远不会改变,也可以保存此id列表。
Module Module1
Private UNIQUEID_MIN_SIZE As Integer = 8
Private UNIQUEID_MAX_SIZE As Integer = 12
Sub Main()
Dim text As String
Dim startTime As DateTime
Dim uniqueIds As List(Of String)
text = GetText()
uniqueIds = GetUniqueIds()
'--- Very slow
startTime = DateTime.Now
' Search
For Each uniqueId As String In uniqueIds
text.Contains(uniqueId)
Next
Console.WriteLine("Took {0}s", DateTime.Now.Subtract(startTime).TotalSeconds)
'--- Very fast
startTime = DateTime.Now
' Split the text by words
Dim words As List(Of String) = text.Split(" ").ToList()
' Get all the unique key, assuming keys are between a specific size
Dim uniqueIdInText As New Dictionary(Of String, String)
For Each word As String In words
If word.Length < UNIQUEID_MIN_SIZE Or word.Length > UNIQUEID_MAX_SIZE Then
If Not uniqueIdInText.ContainsKey(word) Then
uniqueIdInText.Add(word, "")
End If
End If
Next
' Search
For Each uniqueId As String In uniqueIds
uniqueIdInText.ContainsKey(uniqueId)
Next
Console.WriteLine("Took {0}s", DateTime.Now.Subtract(startTime).TotalSeconds)
Console.ReadLine()
End Sub
' This only randomly generate words for testing
' You can ignore
Function GetRandomWord(ByVal len As Integer) As String
Dim builder As New System.Text.StringBuilder
Dim alphabet As String = "abcdefghijklmnopqrstuvwxyz"
Dim rnd As New Random()
For i As Integer = 0 To len - 1
builder.Append(alphabet.Substring(rnd.Next(0, alphabet.Length - 1), 1))
Next
Return builder.ToString()
End Function
Function GetText() As String
Dim builder As New System.Text.StringBuilder
Dim rnd As New Random()
For i As Integer = 0 To 600000
builder.Append(GetRandomWord(rnd.Next(1, 15)))
builder.Append(" ")
Next
Return builder.ToString()
End Function
Function GetUniqueIds() As List(Of String)
Dim wordCount As Integer = 600000
Dim ids As New List(Of String)
Dim rnd As New Random()
For i As Integer = 0 To 125000
ids.Add(GetRandomWord(rnd.Next(UNIQUEID_MIN_SIZE, UNIQUEID_MAX_SIZE)))
Next
Return ids
End Function
End Module