VB.NET - 针对另一个列表的大型通用列表过滤

时间:2011-08-18 06:17:30

标签: list generics filtering

我有两个类型字符串的通用列表,第一个包含大约1,000,000个术语,第二个包含大约100,000个关键字。第一个列表中的术语可能包含也可能不包含第二个列表中的关键字。我需要在第一个列表中隔离第二个列表中不包含任何关键字的那些术语。 目前我这样做(VB.NET with framework 3.5):

For Each keyword In keywordList
    termList.RemoveAll(AddressOf ContainsKeyword)
Next

Private Shared Function ContainsKeyword(ByVal X As String) As Integer
    If X.IndexOf(keyword) >= 0 Then
        Return True
    Else
        Return False
    End If
End Function

毋庸置疑,这需要永远。实现这一目标的最快方法是什么?也许使用字典?任何提示都会有所帮助

1 个答案:

答案 0 :(得分:0)

关键字的直接字典在这里不起作用,因为您正在进行包含检查,而不仅仅是直接的相等检查。您可能采取的一种方法是将搜索项组合到树中。树帮助的数量取决于搜索项中的重叠程度。我把一个基本的树实现(没有太多测试)放在一起作为起点:

Public Class WordSearchTree

    Private ReadOnly _branches As New Dictionary(Of Char, WordSearchTree)

    Public Function WordContainsTerm(ByVal word As String) As Boolean
        Return Not String.IsNullOrEmpty(word) AndAlso _
               Enumerable.Range(0, word.Length - 1) _
                         .Any(Function(i) WordContainsInternal(word, i))
    End Function

    Private Function WordContainsInternal(ByVal word As String, ByVal charIndex As Integer) As Boolean
        Return _branches.Count = 0 OrElse _
               (_branches.ContainsKey(word(charIndex)) AndAlso _
                charIndex < word.Length - 1 AndAlso _
                _branches(word(charIndex)).WordContainsInternal(word, charIndex + 1))
    End Function

    Public Shared Function BuildTree(ByVal words As IEnumerable(Of String)) As WordSearchTree
        If words Is Nothing Then Throw New ArgumentNullException("words")
        Dim ret As New WordSearchTree()
        For Each w In words
            Dim curTree As WordSearchTree = ret
            For Each c In w
                If Not curTree._branches.ContainsKey(c) Then
                    curTree._branches.Add(c, New WordSearchTree())
                End If
                curTree = curTree._branches(c)
            Next
        Next
        Return ret
    End Function

End Class

并且使用那棵树,你可以这样做:

Dim keys As WordSearchTree = WordSearchTree.Build(keywordList)
termList.RemoveAll(AddressOf keys.WordContainsTerm)