我有2个数据集,包含很多字符串,我需要匹配。第一个是1200行,第二个是大约800 000.我通过VBA调用Excel排序对这两个集合进行排序,因此它们按升序排列,因此我可以显着优化搜索速度,通过在最后一行开始第二个数据集的每个下一次迭代匹配。
不幸的是,如果找不到匹配项,则永远不会遇到Exit For
,即使根据我搜索的字词检查的字符串还有字母表(&gt; my string)。我尝试实现比较If vData1(arrayIndex1, 1) < vData2(arrayIndex2, 1) Then
(可能之前使用'Mod'检查,如果在每次迭代中执行它会很慢),但我遇到不正确的比较值,例如?"µm">"zzzzz"
返回true,而在数据中在字符串以“a”开头之前设置它是应该的。
有没有可靠的方法来解决这个问题?
Dim optimizedCounter as long, arrayIndex1 as long, arrayIndex2 as long
Dim vData1 as variant, vData2 as variant
'sort 2 data sets in Excel ascending
'assign data sets to arrays vData1 and vData2
optimizedCounter = LBound(vData2)
For arrayIndex1 = LBound(vData1) To UBound(vData1)
For arrayIndex2 = optimizedCounter To UBound(vData2)
If vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1) Then
'do action when strings match
optimizedCounter = arrayIndex2 'narrow down 2nd data set's list, arrayIndex2 + 1 if vData1 has no duplicates
Exit For 'match has been found, exit loop and continue matching for next element in 1st data set
End If
Next arrayIndex2
Next arrayIndex1
修改
感谢大家提出的精彩建议。目前,A.S.H与Application.Evaluate
/ StrComp
的解决方案为我做了诀窍。因为我使用默认二进制比较vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1)
,我想保留当前速度,我不能使用选项比较文本。
For arrayIndex1 = LBound(vData1) To UBound(vData1)
For arrayIndex2 = optimizedCounter To UBound(vData2)
If vData1(arrayIndex1, 1) = vData2(arrayIndex2, 1) Then
'do action when strings match
optimizedCounter = arrayIndex2 'narrow down 2nd data set's list, arrayIndex2 + 1 if vData1 has no duplicates
Exit For 'match has been found, exit loop and continue matching for next element in 1st data set
ElseIf arrayIndex2 Mod 1000 = 0 Then
If Application.Evaluate("""" & vData2(arrayIndex2, 1) & _
""" > """ & vData1(arrayIndex1, 1) & """") Then Exit For
'line below can be used instead of Application.Evaluate, the same speed, easier structure
'If StrComp(vData2(arrayIndex2, 1), vData1(arrayIndex1, 1), vbTextCompare) = 1 Then Exit For
End If
Next arrayIndex2
Next arrayIndex1
由于此方法需要一些时间,因此为了获得性能增益,我不得不每隔n次迭代使用它。根据数据集长度和匹配值的百分比,最佳mod值将不同。
作为对已检查组合数量的评论,我的搜索字词列表包含重复项。
香草代码:
执行时间:12.76
处理的组合:144596591
Application.Evaluate或StrComp:
执行时间:17.30
处理的组合:1192341
Application.Evaluate或StrComp条件mod 50 = 0:
执行时间:0.48
处理的组合:1201717
条件mod 1000 = 0时的Application.Evaluate或StrComp:
执行时间:0.16
处理的组合:1376317
由于处理的组合数量较多,增加mod值会增加此时的执行时间。
我尝试将With Application
置于主循环之外并使用.Evaluate,它对速度完全没有影响。
编辑2:
Application.Match
和Application.Vlookup
不适用于带&gt;的数组65536行。然而,正如评论所指出的,它们确实适用于范围。
Dim vMatch as Variant, myRng as Range
'myRng is set to one-column range of values to look for, about 800K rows
For arrayIndex1 = LBound(vData1) To UBound(vData1)
vMatch = Application.Match(vData1(arrayIndex1, 1), myRng, 0)
If Not IsError(vMatch) Then
'do action when strings match
End If
Next arrayIndex1
MatchType = 0的Application.Match:
执行时间:28.81
查找次数:1200
答案 0 :(得分:2)
If vData1(arrayIndex1, 1) < vData2(arrayIndex2, 1) Then
...我遇到不正确的比较值,例如?"µm">"zzzzz"
返回true,而在数据集中它应该是,在以“a”开头的字符串之前。
实际上,如果字符串比较操作在先前的排序和代码中不同,则先前的排序变得无用。而这是因为
默认情况下,VBA中的比较为二进制
?"µm">"zzzzz" ---> True
?Application.Evaluate("""µm"">""zzzzz""") ---> False
?StrComp("µm", "zzzzz") ---> 1
?StrComp("µm", "zzzzz", vbTextCompare) ---> -1
^^^^^^^^^^^^^^
P.S。除非您在评论中指出Option Compare Text
或strComp
,或使用Excel的比较:
If Application.Evaluate("""" & vData1(arrayIndex1, 1) & _
""" < """ & vData2(arrayIndex2, 1) & """") Then
这将解决比较不匹配的问题。实际上,基于<
比较停止循环应该会更快。这是否是最好的算法是另一个争论。您的数组正在排序,二进制搜索应该是一个完美的候选者。
除非你进行二进制搜索,否则考虑使用Excel的内置函数,即Application.VLookup
或Application.Match
,它们几乎比VBA循环快一个数量级,即使后者正在开发预取数组。
答案 1 :(得分:1)
我使用一些二进制匹配函数进行了一些测试,它在大约3秒内运行2个数据集129K行对780K行,进行335K比较检查。这是二元搜索的力量+稍微调整一下。
一些经过修改的二进制搜索&#39;效用函数:
Public Function wsArrayBinaryMatch( _
ByVal val As Variant, _
arr() As Variant, _
ByVal searchCol As Long, _
Optional optimalStart As Long, Optional optimalEnd As Long, Optional exactMatch As Boolean = True) As Variant
Dim a As Long, z As Long, curr As Long
wsArrayBinaryMatch = "Not Found in Range"
a = IIf(optimalStart, optimalStart, LBound(arr))
z = IIf(optimalEnd, optimalEnd, UBound(arr))
If compare(arr(a, searchCol), val) = 1 Then
Exit Function
End If
If compare(arr(a, searchCol), val) = 0 Then
wsArrayBinaryMatch = a
Exit Function
End If
If compare(arr(z, searchCol), val) = -1 Then
Exit Function
End If
While z - a > 1
curr = Round((CLng(a) + CLng(z)) / 2, 0)
If compare(arr(curr, searchCol), val) = 0 Then
z = curr
wsArrayBinaryMatch = curr
End If
If compare(arr(curr, searchCol), val) = -1 Then
a = curr
Else
z = curr
End If
Wend
If compare(arr(z, searchCol), val) = 0 Then
wsArrayBinaryMatch = z
Else
If Not exactMatch Then
wsArrayBinaryMatch = a
Else
'approx match to val was found inside the range...
wsArrayBinaryMatch = "ApproxIndex" & a
End If
End If
End Function
Public Function wsArrayBinaryLookup( _
ByVal val As Variant, _
arr() As Variant, _
ByVal searchCol As Long, _
ByVal returnCol As Long, _
Optional exactMatch As Boolean = True) As Variant
Dim a As Long, z As Long, curr As Long
wsArrayBinaryLookup = CVErr(xlErrNA)
a = LBound(arr)
z = UBound(arr)
If compare(arr(a, searchCol), val) = 1 Then
Exit Function
End If
If compare(arr(a, searchCol), val) = 0 Then
wsArrayBinaryLookup = arr(a, returnCol)
Exit Function
End If
If compare(arr(z, searchCol), val) = -1 Then
Exit Function
End If
While z - a > 1
curr = Round((CLng(a) + CLng(z)) / 2, 0)
If compare(arr(curr, searchCol), val) = 0 Then
z = curr
wsArrayBinaryLookup = arr(curr, returnCol)
End If
If compare(arr(curr, searchCol), val) = -1 Then
a = curr
Else
z = curr
End If
Wend
If compare(arr(z, searchCol), val) = 0 Then
wsArrayBinaryLookup = arr(z, returnCol)
Else
If Not exactMatch Then
wsArrayBinaryLookup = arr(a, returnCol)
End If
End If
End Function
Public Function compare(ByVal x As Variant, ByVal y As Variant) As Long
If IsNumeric(x) And IsNumeric(y) Then
Select Case x - y
Case Is = 0
compare = 0
Case Is > 0
compare = 1
Case Is < 0
compare = -1
End Select
Else
If TypeName(x) = "String" And TypeName(y) = "String" Then
compare = StrComp(x, y, vbTextCompare)
End If
End If
End Function
然后我编写了一个sub(可以转换为函数),试图充分利用排序数据,并提高限制搜索范围的效率。这涉及在第一个数据集中尝试查找低和高项目之间的交替。
请注意,两个数据集每个只有2列,并且它正在从每个数据集的第1列搜索匹配项。如果找到匹配,则它设置第一组中第二列的值。
在一个字符串中返回约匹配的方法有点hacky,但我累了......
Sub BinaryMatchInSortedSets()
Dim set1() As Variant, set2() As Variant
set1 = Sheet1.Range("E2:F129601").Value '129K rows of strings and column F says 'Default'
set2 = Sheet1.Range("I2:J780001").Value '780K rows of strings and numbers
Dim low1 As Long, high1 As Long
Dim low2 As Long, high2 As Long
Dim indexToFind As Long, approxIndex As Long
low1 = LBound(set1)
high1 = UBound(set1)
low2 = LBound(set2)
high2 = UBound(set2)
Dim errString As String
Dim matchIndex As Variant
Dim searchingForLow As Boolean: searchingForLow = True
While low1 <= high1 And low2 < high2
indexToFind = IIf(searchingForLow, low1, high1)
matchIndex = wsArrayBinaryMatch(set1(indexToFind, 1), set2, 1, low2, high2, True)
If IsNumeric(matchIndex) Then
'match found
low2 = IIf(searchingForLow, matchIndex, low2)
high2 = IIf(searchingForLow, high2, matchIndex)
'do all other stuff in here that needs doing when match is found...
set1(indexToFind, 2) = set2(matchIndex, 2) 'Just an example of what you could do
Else
'no match, so set up efficient search range if possible
If Left(matchIndex, 11) = "ApproxIndex" Then
approxIndex = Mid(matchIndex, 12)
If searchingForLow Then
low2 = approxIndex + 1
Else
high2 = approxIndex - 1
End If
End If
End If
If searchingForLow Then
low1 = low1 + 1
Else
high1 = high1 - 1
End If
searchingForLow = Not searchingForLow
Wend
Sheet1.Range("L2").Resize(UBound(set1) - LBound(set1) + 1, 2).Value = set1
End Sub