我有两个类型字符串的通用列表,第一个包含大约1,000,000个术语,第二个包含大约100,000个关键字。第一个列表中的术语可能包含也可能不包含第二个列表中的关键字。我需要在第一个列表中隔离第二个列表中不包含任何关键字的那些术语。 目前我这样做(VB.NET with framework 3.5):
For Each keyword In keywordList
termList.RemoveAll(AddressOf ContainsKeyword)
Next
Private Shared Function ContainsKeyword(ByVal X As String) As Integer
If X.IndexOf(keyword) >= 0 Then
Return True
Else
Return False
End If
End Function
毋庸置疑,这需要永远。实现这一目标的最快方法是什么?也许使用字典?任何提示都会有所帮助
答案 0 :(得分:0)
关键字的直接字典在这里不起作用,因为您正在进行包含检查,而不仅仅是直接的相等检查。您可能采取的一种方法是将搜索项组合到树中。树帮助的数量取决于搜索项中的重叠程度。我把一个基本的树实现(没有太多测试)放在一起作为起点:
Public Class WordSearchTree
Private ReadOnly _branches As New Dictionary(Of Char, WordSearchTree)
Public Function WordContainsTerm(ByVal word As String) As Boolean
Return Not String.IsNullOrEmpty(word) AndAlso _
Enumerable.Range(0, word.Length - 1) _
.Any(Function(i) WordContainsInternal(word, i))
End Function
Private Function WordContainsInternal(ByVal word As String, ByVal charIndex As Integer) As Boolean
Return _branches.Count = 0 OrElse _
(_branches.ContainsKey(word(charIndex)) AndAlso _
charIndex < word.Length - 1 AndAlso _
_branches(word(charIndex)).WordContainsInternal(word, charIndex + 1))
End Function
Public Shared Function BuildTree(ByVal words As IEnumerable(Of String)) As WordSearchTree
If words Is Nothing Then Throw New ArgumentNullException("words")
Dim ret As New WordSearchTree()
For Each w In words
Dim curTree As WordSearchTree = ret
For Each c In w
If Not curTree._branches.ContainsKey(c) Then
curTree._branches.Add(c, New WordSearchTree())
End If
curTree = curTree._branches(c)
Next
Next
Return ret
End Function
End Class
并且使用那棵树,你可以这样做:
Dim keys As WordSearchTree = WordSearchTree.Build(keywordList)
termList.RemoveAll(AddressOf keys.WordContainsTerm)