VBA - 匹配2个排序字符串数组,其中一些元素不匹配 - 优化

时间:2017-04-21 21:18:07

标签: arrays excel vba algorithm sorting

我有2个数据集,包含很多字符串,我需要匹配。第一个是1200行,第二个是大约800 000.我通过VBA调用Excel排序对这两个集合进行排序,因此它们按升序排列,因此我可以显着优化搜索速度,通过在最后一行开始第二个数据集的每个下一次迭代匹配。

不幸的是,如果找不到匹配项,则永远不会遇到Exit For,即使根据我搜索的字词检查的字符串还有字母表(&gt; my string)。我尝试实现比较If vData1(arrayIndex1, 1) < vData2(arrayIndex2, 1) Then(可能之前使用'Mod'检查,如果在每次迭代中执行它会很慢),但我遇到不正确的比较值,例如?"µm">"zzzzz"返回true,而在数据中在字符串以“a”开头之前设置它是应该的。

有没有可靠的方法来解决这个问题?

Dim optimizedCounter as long, arrayIndex1 as long, arrayIndex2 as long
Dim vData1 as variant, vData2 as variant

'sort 2 data sets in Excel ascending
'assign data sets to arrays vData1 and vData2

optimizedCounter = LBound(vData2)

For arrayIndex1 = LBound(vData1) To UBound(vData1)
    For arrayIndex2 = optimizedCounter To UBound(vData2)
        If vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1) Then
            'do action when strings match
            optimizedCounter = arrayIndex2 'narrow down 2nd data set's list, arrayIndex2 + 1 if vData1 has no duplicates
            Exit For 'match has been found, exit loop and continue matching for next element in 1st data set
        End If
    Next arrayIndex2
Next arrayIndex1

修改

感谢大家提出的精彩建议。目前,A.S.H与Application.Evaluate / StrComp的解决方案为我做了诀窍。因为我使用默认二进制比较vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1),我想保留当前速度,我不能使用选项比较文本。

For arrayIndex1 = LBound(vData1) To UBound(vData1)
    For arrayIndex2 = optimizedCounter To UBound(vData2)
        If vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1) Then
            'do action when strings match
            optimizedCounter = arrayIndex2 'narrow down 2nd data set's list, arrayIndex2 + 1 if vData1 has no duplicates
            Exit For 'match has been found, exit loop and continue matching for next element in 1st data set
        ElseIf arrayIndex2 Mod 1000 = 0 Then
            If Application.Evaluate("""" & vData2(arrayIndex2, 1) & _
            """ > """ & vData1(arrayIndex1, 1) & """") Then Exit For
            'line below can be used instead of Application.Evaluate, the same speed, easier structure
            'If StrComp(vData2(arrayIndex2, 1), vData1(arrayIndex1, 1), vbTextCompare) = 1 Then Exit For
        End If           
    Next arrayIndex2
Next arrayIndex1 

由于此方法需要一些时间,因此为了获得性能增益,我不得不每隔n次迭代使用它。根据数据集长度和匹配值的百分比,最佳mod值将不同。

作为对已检查组合数量的评论,我的搜索字词列表包含重复项。

香草代码:

执行时间:12.76

处理的组合:144596591

Application.Evaluate或StrComp:

执行时间:17.30

处理的组合:1192341

Application.Evaluate或StrComp条件mod 50 = 0:

执行时间:0.48

处理的组合:1201717

条件mod 1000 = 0时的Application.Evaluate或StrComp:

执行时间:0.16

处理的组合:1376317

由于处理的组合数量较多,增加mod值会增加此时的执行时间。

我尝试将With Application置于主循环之外并使用.Evaluate,它对速度完全没有影响。

编辑2:

Application.MatchApplication.Vlookup不适用于带&gt;的数组65536行。然而,正如评论所指出的,它们确实适用于范围。

Dim vMatch as Variant, myRng as Range

'myRng is set to one-column range of values to look for, about 800K rows

For arrayIndex1 = LBound(vData1) To UBound(vData1)
        vMatch = Application.Match(vData1(arrayIndex1, 1), myRng, 0)
        If Not IsError(vMatch) Then
            'do action when strings match
        End If
Next arrayIndex1

MatchType = 0的Application.Match:

执行时间:28.81

查找次数:1200

2 个答案:

答案 0 :(得分:2)

  

If vData1(arrayIndex1, 1) < vData2(arrayIndex2, 1) Then ...我遇到不正确的比较值,例如?"µm">"zzzzz"返回true,而在数据集中它应该是,在以“a”开头的字符串之前。

实际上,如果字符串比较操作在先前的排序和代码中不同,则先前的排序变得无用。而这是因为

默认情况下,VBA中的比较为二进制

?"µm">"zzzzz"                              --->   True
?Application.Evaluate("""µm"">""zzzzz""")  --->   False
?StrComp("µm", "zzzzz")                    --->    1 
?StrComp("µm", "zzzzz", vbTextCompare)     --->   -1 
                        ^^^^^^^^^^^^^^

P.S。除非您在评论中指出Option Compare TextstrComp,或使用Excel的比较:

If Application.Evaluate("""" & vData1(arrayIndex1, 1) & _ 
    """ < """ & vData2(arrayIndex2, 1) & """") Then

这将解决比较不匹配的问题。实际上,基于<比较停止循环应该会更快。这是否是最好的算法是另一个争论。您的数组正在排序,二进制搜索应该是一个完美的候选者。

除非你进行二进制搜索,否则考虑使用Excel的内置函数,即Application.VLookupApplication.Match,它们几乎比VBA循环快一个数量级,即使后者正在开发预取数组。

答案 1 :(得分:1)

我使用一些二进制匹配函数进行了一些测试,它在大约3秒内运行2个数据集129K行对780K行,进行335K比较检查。这是二元搜索的力量+稍微调整一下。

一些经过修改的二进制搜索&#39;效用函数:

Public Function wsArrayBinaryMatch( _
        ByVal val As Variant, _
        arr() As Variant, _
        ByVal searchCol As Long, _
        Optional optimalStart As Long, Optional optimalEnd As Long, Optional exactMatch As Boolean = True) As Variant

  Dim a As Long, z As Long, curr As Long

  wsArrayBinaryMatch = "Not Found in Range"
  a = IIf(optimalStart, optimalStart, LBound(arr))
  z = IIf(optimalEnd, optimalEnd, UBound(arr))

  If compare(arr(a, searchCol), val) = 1 Then
        Exit Function
  End If

  If compare(arr(a, searchCol), val) = 0 Then
        wsArrayBinaryMatch = a
        Exit Function
  End If

  If compare(arr(z, searchCol), val) = -1 Then
        Exit Function
  End If

  While z - a > 1
        curr = Round((CLng(a) + CLng(z)) / 2, 0)
        If compare(arr(curr, searchCol), val) = 0 Then
              z = curr
              wsArrayBinaryMatch = curr
        End If

        If compare(arr(curr, searchCol), val) = -1 Then
              a = curr
        Else
              z = curr
        End If
  Wend

  If compare(arr(z, searchCol), val) = 0 Then
        wsArrayBinaryMatch = z
  Else
        If Not exactMatch Then
              wsArrayBinaryMatch = a
        Else
              'approx match to val was found inside the range...
              wsArrayBinaryMatch = "ApproxIndex" & a
        End If
  End If

End Function

Public Function wsArrayBinaryLookup( _
        ByVal val As Variant, _
        arr() As Variant, _
        ByVal searchCol As Long, _
        ByVal returnCol As Long, _
        Optional exactMatch As Boolean = True) As Variant

  Dim a As Long, z As Long, curr As Long

  wsArrayBinaryLookup = CVErr(xlErrNA)
  a = LBound(arr)
  z = UBound(arr)

  If compare(arr(a, searchCol), val) = 1 Then
        Exit Function
  End If

  If compare(arr(a, searchCol), val) = 0 Then
        wsArrayBinaryLookup = arr(a, returnCol)
        Exit Function
  End If

  If compare(arr(z, searchCol), val) = -1 Then
        Exit Function
  End If

  While z - a > 1
        curr = Round((CLng(a) + CLng(z)) / 2, 0)
        If compare(arr(curr, searchCol), val) = 0 Then
              z = curr
              wsArrayBinaryLookup = arr(curr, returnCol)
        End If

        If compare(arr(curr, searchCol), val) = -1 Then
              a = curr
        Else
              z = curr
        End If
  Wend

  If compare(arr(z, searchCol), val) = 0 Then
        wsArrayBinaryLookup = arr(z, returnCol)
  Else
        If Not exactMatch Then
              wsArrayBinaryLookup = arr(a, returnCol)
        End If
  End If

End Function

Public Function compare(ByVal x As Variant, ByVal y As Variant) As Long

  If IsNumeric(x) And IsNumeric(y) Then
        Select Case x - y
              Case Is = 0
                    compare = 0
              Case Is > 0
                    compare = 1
              Case Is < 0
                    compare = -1
        End Select
  Else
        If TypeName(x) = "String" And TypeName(y) = "String" Then
              compare = StrComp(x, y, vbTextCompare)
        End If
  End If

End Function
然后我编写了一个sub(可以转换为函数),试图充分利用排序数据,并提高限制搜索范围的效率。这涉及在第一个数据集中尝试查找低和高项目之间的交替。

请注意,两个数据集每个只有2列,并且它正在从每个数据集的第1列搜索匹配项。如果找到匹配,则它设置第一组中第二列的值。

在一个字符串中返回约匹配的方法有点hacky,但我累了......

Sub BinaryMatchInSortedSets()

  Dim set1() As Variant, set2() As Variant
  set1 = Sheet1.Range("E2:F129601").Value  '129K rows of strings and column F says 'Default'
  set2 = Sheet1.Range("I2:J780001").Value  '780K rows of strings and numbers

  Dim low1 As Long, high1 As Long
  Dim low2 As Long, high2 As Long
  Dim indexToFind As Long, approxIndex As Long
  low1 = LBound(set1)
  high1 = UBound(set1)
  low2 = LBound(set2)
  high2 = UBound(set2)

  Dim errString As String
  Dim matchIndex As Variant
  Dim searchingForLow As Boolean: searchingForLow = True
  While low1 <= high1 And low2 < high2
        indexToFind = IIf(searchingForLow, low1, high1)

        matchIndex = wsArrayBinaryMatch(set1(indexToFind, 1), set2, 1, low2, high2, True)

        If IsNumeric(matchIndex) Then
              'match found
              low2 = IIf(searchingForLow, matchIndex, low2)
              high2 = IIf(searchingForLow, high2, matchIndex)
              'do all other stuff in here that needs doing when match is found...
              set1(indexToFind, 2) = set2(matchIndex, 2)  'Just an example of what you could do
        Else
              'no match, so set up efficient search range if possible
              If Left(matchIndex, 11) = "ApproxIndex" Then
                    approxIndex = Mid(matchIndex, 12)
                    If searchingForLow Then
                          low2 = approxIndex + 1
                    Else
                          high2 = approxIndex - 1
                    End If
              End If
        End If

        If searchingForLow Then
              low1 = low1 + 1
        Else
              high1 = high1 - 1
        End If
        searchingForLow = Not searchingForLow
  Wend
  Sheet1.Range("L2").Resize(UBound(set1) - LBound(set1) + 1, 2).Value = set1

End Sub