在VBA中创建倒排索引的有效方法

时间:2014-03-03 01:21:29

标签: arrays vba excel-vba dictionary inverted-index

我正在创建一个倒置索引来获取单词字典,其中包含单词出现的行号的相关列表(开始行号和出现在该行内给定单元格中的单词列表)。 / p>

我已经设法为此工作了一些代码,但我发现处理添加到数组(字典中的值)有点麻烦,我想知道是否有更高效或更优雅的方式来处理此

我愿意使用数组,集合或任何其他可以轻松搜索的数据类型来存储字典值中的行号列表。我已经粘贴了我的代码的缩减版本来演示下面的核心问题,问题实际上只是关于BuildInvertedIndex过程,但其余部分是为了尝试更容易重新创建场景:

Sub Test()
' minimum included here to demonstrate use of buildInvertedIndex procedure

    Dim vRange As Range
    Dim vDict As Dictionary

    Set vRange = ActiveSheet.Range("F2:F20585")
    Set vDict = New Dictionary

    BuildInvertedIndex vDict, vRange

    ' test values returned in dictionary (word: [line 1, ..., line n])
    Dim k As Variant, vCounter As Long
    vCounter = 0
    For Each k In vDict.Keys
        Debug.Print k & ": " & ArrayToString(vDict.Item(k))
        vCounter = vCounter + 1
        If vCounter >= 10 Then
            Exit For
        End If
    Next


End Sub


Sub BuildInvertedIndex(pDict As Dictionary, pRange As Range)

    Dim cell As Range
    Dim words As Variant, word As Variant, val As Variant
    Dim tmpArr() As Long
    Dim newLen As Long, i As Long

    ' loop through cells (one col wide so same as looping through lines)
    For Each cell In pRange.Cells

        ' loop through words in line
        words = Split(cell.Value)
        For Each word In words

            If Not pDict.exists(word) Then
                ' start line array with first row number
                pDict.Add word, Array(cell.Row())
            Else
                i = 0
                If Not InArray(cell.Row(), pDict.Item(word)) Then
                    newLen = UBound(pDict.Item(word)) + 1
                    ReDim tmpArr(newLen)
                    For Each val In tmpArr
                        If i < newLen Then
                            tmpArr(i) = pDict.Item(word)(i)
                        Else
                            tmpArr(i) = cell.Row()
                        End If
                        i = i + 1
                    Next val
                    pDict.Item(word) = tmpArr
                End If
            End If
        Next word
    Next cell

End Sub


Function ArrayToString(vArray As Variant, _
                       Optional vDelim As String = ",") As String
' only included to support test (be able to see what is in the arrays)

    Dim vDelimString As String
    Dim i As Long

    For i = LBound(vArray) To UBound(vArray)
        vDelimString = vDelimString & CStr(vArray(i)) & _
                       IIf(vCounter < UBound(vArray), vDelim, "")
    Next

    ArrayToString = vDelimString
End Function

要运行此项,您将需要活动工作表(句子)的F列中的值,如果您还没有它,则还需要在VBA环境中为字典数据类型添加对Microsoft Scripting Runtime的引用可用(工具 - &gt;参考 - &gt; Microsoft Scripting Runtime)。

正如您将从代码中看到的那样,我需要在现有数组中插入新的行号(在字典中存储为值)时会有点混乱。由于我不知道只是扩展此数组的方法(不清除现有值),我使用变量tmpArr创建适当大小的数组,然后从字典中的现有数组中逐个复制值然后将当前行号添加到结尾。然后使用临时数组替换该键的现有值(当前单词)。

对此有任何建议将不胜感激。

1 个答案:

答案 0 :(得分:1)

  

我愿意使用数组,集合或任何其他数据类型

正如我所见,使用collection而不是数组会更简单:

Sub BuildInvertedIndex(pDict As Dictionary, pRange As Range)
    Dim cell As Range
    Dim words, word
    Dim i As Long    
    ' loop through cells (one col wide so same as looping through lines)
    For Each cell In pRange.Cells    
        ' loop through words in line
        words = Split(cell.Value)
        For Each word In words    
            If Not pDict.Exists(word) Then
                ' initialize collection
                pDict.Add word, New Collection
            End If
            'try to add to collection. If row is already in collecton, nothing happend. Storing key makes you sure there're only unique rows
            On Error Resume Next
            pDict.Item(word).Add Item:=cell.Row, Key:=CStr(cell.Row)
            On Error GoTo 0                
        Next word
    Next cell
End Sub

下一步,是将ArrayToString略微修改为ColToString

Function ColToString(vCol As Collection, _
                   Optional vDelim As String = ",") As String
' only included to support test (be able to see what is in the arrays)

    Dim vDelimString As String
    Dim i As Long

    For i = 1 To vCol.Count
        vDelimString = vDelimString & CStr(vCol.Item(i)) & _
                       IIf(i < vCol.Count, vDelim, "")
    Next

    ColToString = vDelimString
End Function

和测试子程序(仅更改了一行 - Debug.Print k & ": " & ColToString(vDict.Item(k)),目标范围更改为"F2:F5"):

Sub Test()
' minimum included here to demonstrate use of buildInvertedIndex procedure

    Dim vRange As Range
    Dim vDict As Dictionary

    Set vRange = ActiveSheet.Range("F2:F5")
    Set vDict = New Dictionary

    BuildInvertedIndex vDict, vRange

    ' test values returned in dictionary (word: [line 1, ..., line n])
    Dim k As Variant, vCounter As Long
    vCounter = 0
    For Each k In vDict.Keys
        Debug.Print k & ": " & ColToString(vDict.Item(k))
        vCounter = vCounter + 1
        If vCounter >= 10 Then
            Exit For
        End If
    Next

    'clean up memory
    Set vDict = Nothing
End Sub

<强>结果:

enter image description here


<强>更新

为了提高代码的速度,您可以在数组中存储范围(下一个方法仅适用于单列范围,但您可以轻松修改它):

测试子:

Sub TestWirhArray()
' minimum included here to demonstrate use of buildInvertedIndex procedure

    Dim vRange As Range
    Dim vDict As Dictionary
    Dim myArr As Variant

    Set vDict = New Dictionary
    Set vRange = ActiveSheet.Range("F2:F20585")
    myArr = vRange.Value
    BuildInvertedIndexWithArr vDict, myArr, vRange.Row

    ' test values returned in dictionary (word: [line 1, ..., line n])
    Dim k As Variant, vCounter As Long
    vCounter = 0
    For Each k In vDict.Keys
        Debug.Print k & ": " & ColToString(vDict.Item(k))
        vCounter = vCounter + 1
        If vCounter >= 10 Then
            Exit For
        End If
    Next

    'clean up memory
    Set vDict = Nothing
End Sub

新版BuildInvertedIndexWithArr

Sub BuildInvertedIndexWithArr(pDict As Dictionary, pArr, firstRow As Long)
    Dim cell, words, word
    Dim i As Long, j As Long

    j = firstRow
    ' loop through cells (one col wide so same as looping through lines)
    For Each cell In pArr

        ' loop through words in line
        words = Split(cell)
        For Each word In words

            If Not pDict.exists(word) Then
                ' initialize collection
                pDict.Add word, New Collection
            End If

            On Error Resume Next
            pDict.Item(word).Add Item:=j, Key:=CStr(j)
            On Error GoTo 0

        Next word
        j = j + 1
    Next cell
End Sub