我需要比较两个数据集并从中提取匹配项。我有一个来自每个数据集中5列的复合键,结束了我需要提取的第6列。列由文本,日期和整数组成。两组都略低于500k行。
目前我在表a中使用for循环并循环遍历表b。使用带有和参数的if语句比较行以获取复合键。
Sub ArraySearch()
Dim Main As Long
Dim Search As Long
Dim arrData() As Variant
Dim arrSource As Variant
arrData = Sheets("Sheet1").Range("H3:M500000").Value
arrSource = Sheets("Ark1").Range("A3:H500000").Value
Main = 1
Search = 1
For Main = 1 To UBound(arrSource, 1)
For Search = 1 To UBound(arrData, 1)
If arrSource(Main, 3) = arrData(Search, 1) And _
arrSource(Main, 4) = arrData(Search, 2) And _
arrSource(Main, 1) = arrData(Search, 3) And _
arrSource(Main, 2) = arrData(Search, 4) And _
arrSource(Main, 5) = arrData(Search, 5) _
Then
arrSource(Main, 8) = arrData(Search, 6)
Exit For
End If
Next
Next
Sheets("Sheet2").Range("A3:H500000") = arrSource
End Sub
到目前为止,最快的方法是将两个表加载到一个数组中并执行内存循环。
这是永远的。我们说的是几小时而不是几分钟。
有没有什么方法可以提高速度? 或者我需要使用其他一些程序吗? (将其加载到数据库并使用SQL,使用Visual Studio与普通的VB.net,SSIS)
我希望这可以在VBA中完成,所以任何指针都会非常感激。
修改
散列5列键是否会提高速度,还是必须迭代的行的共享量会产生滞后?
答案 0 :(得分:5)
比较两个列表的最快方法是根据公共密钥向Dictionary添加值。字典经过优化,可以搜索键,并且可以更快地返回基于键的值,然后可以迭代数组。
Sub DictionarySearch()
Dim dict
Dim key As String
Dim x As Long
Dim arrData() As Variant
Dim arrSource As Variant
Set dict = CreateObject("Scripting.Dictionary")
arrData = Worksheets("Sheet1").Range("H3:M500000").Value
arrSource = Worksheets("Ark1").Range("A3:H500000").Value
For x = 1 To UBound(arrData, 1)
key = arrData(x, 1) & ":" & arrData(x, 2) & ":" & arrData(x, 3) & ":" & arrData(x, 4) & ":" & arrData(x, 5)
If Not dict.Exists(key) Then dict.Add key, arrData(x, 6)
Next
For x = 1 To UBound(arrSource, 1)
key = arrSource(x, 3) & ":" & arrSource(x, 4) & ":" & arrSource(x, 1) & ":" & arrSource(x, 2) & ":" & arrSource(x, 5)
If dict.Exists(key) Then arrSource(x, 8) = dict(key)
Next
Sheets("Sheet2").Range("A3:H500000") = arrSource
End Sub
答案 1 :(得分:2)
不是一个完整的答案,但值得一试的想法。在this answer of mine to my own question中,我使用了一些加速技巧,比如使用.Value2
而不是默认属性(.Value
)并指定 vbNullString 而不是零长度字符串(" ")找到匹配的数组元素,使Excel处理更少。也许您可以Heap's algorithm使用this answer,但我不太确定。
答案 2 :(得分:1)
欢迎来到性能改进的奇妙世界: - )
让我解释一下你在做什么: 您正在获取两个数据集,每个数据集包含500,000个条目。然后你循环遍历它们,如下所示:
for every member in dataset1 do
for every member in dataset2 do
if condition1 is met, and
if condition2 is met, and
if condition3 is met, and
if condition4 is met, and
if condition5 is met
then do something
end if-loop
end for-loop (dataset2)
end for-loop (dataset1)
当您计算您正在执行的操作数量时,我们会看到以下内容:
500,000 runs through dataset1
500,000 runs through dataset2
5 (number of conditions to check)
=> 1250,000,000,000 actions, this is enormous!
最重要的是,你正在使用VBA:VBA是一种脚本语言,这意味着你到达这一行代码时,每一行代码都被翻译成机器语言(如果你要使用另一种语言,那么你可以编译,翻译成机器语言只会进行一次,这种机器语言将在之后执行)
如果您想继续使用VBA,我可以提供两条建议:
这将导致这种新算法:
for every member in dataset1 do
go in dataset2 from the start to the maximum, defined by the first for-loop, and do
if condition1 is met, then:
if condition2 is met, then:
if condition3 is met, then:
if condition4 is met, then:
if condition5 is met
then do something
end if-loop
end if-loop
end if-loop
end if-loop
end if-loop
end for-loop (dataset2)
end for-loop (dataset1)
这种工作方式可以减少计算机执行的操作量:
500,000 runs through dataset1
log(500,000) runs through dataset2 (it's only browsed until a certain limit)
3 conditions (on average)
=> 500,000 * log(500,000) * 3 = 8,500,000 actions (on average), which is manageable
我希望这对你有意义。在我看来,这里的主要问题是对数据集进行排序,这将大大提高您的性能!
答案 3 :(得分:0)
Excel需要评估的大量数据。 问题是,Excel是最好的解决方案吗,或者如果你用C ++或C#编写用于比较的应用程序会更好吗?因为它们会快得多。
但是如果您需要在VBA中执行此操作,则此代码可以帮助您。如果它们具有相同的数据,我总是使用它来比较2个范围。而且我从未对该代码有过速度问题,所以也许你可以看看它。
Sub Start()
Dim rng1 As Range
Dim rng2 As Range
Dim bolNotEqual As Boolean
Set rng1 = Sheets("Sheet1").Range("H3:M500000").Value
Set rng2 = arrSource = Sheets("Ark1").Range("A3:H500000").Value
'Compare the Sheets if both are Equal
Call CompareWorksheetRanges(rng1, rng2, bolNotEqual)
End Sub
Sub CompareWorksheetRanges(rng1 As Range, rng2 As Range, ByRef bol As Boolean)
Dim r As Long, c As Integer
Dim lr1 As Long, lr2 As Long, lc1 As Integer, lc2 As Integer
Dim maxR As Long, maxC As Integer, cf1 As String, cf2 As String
Dim rptWB As Workbook, DiffCount As Long
'If one rng is Empty Exit sup
If rng1 Is Nothing Or rng2 Is Nothing Then Exit Sub
If rng1.Areas.Count > 1 Or rng2.Areas.Count > 1 Then
MsgBox "Can't compare multiple selections!", _
vbExclamation, "Compare Worksheet Ranges"
Exit Sub
End If
Application.StatusBar = "Creating the report..."
'Testing if the Ranges have the Same sice
Set rptWB = Workbooks.Add
With rng1
lr1 = .Rows.Count
lc1 = .Columns.Count
End With
With rng2
lr2 = .Rows.Count
lc2 = .Columns.Count
End With
maxR = lr1
maxC = lc1
If maxR < lr2 Then maxR = lr2
If maxC < lc2 Then maxC = lc2
If lr1 <> lr2 Or lc1 <> lc2 Then
If MsgBox("The two ranges you want to compare are of different size!" & _
Chr(13) & "Do you want to continue anyway?", _
vbQuestion + vbYesNo, "Compare Worksheet Ranges") = vbNo Then Exit Sub
End If
'End Testing sice
DiffCount = 0
'Compare the Ranges if same Value
For c = 1 To maxC
Application.StatusBar = "Comparing cells " & _
Format(c / maxC, "0 %") & "..."
For r = 1 To maxR
cf1 = ""
cf2 = ""
On Error Resume Next
cf1 = rng1.Cells(r, c).FormulaLocal
cf2 = rng2.Cells(r, c).FormulaLocal
On Error GoTo 0
If cf1 <> cf2 Then
DiffCount = DiffCount + 1
Cells(r, c).Formula = "'" & cf1 & " <> " & cf2
End If
Next r
Next c
Application.StatusBar = "Formatting the report..."
rptWB.Close False
Set rptWB = Nothing
If DiffCount = 0 Then
bol = False
Else
bol = True
End If
Application.StatusBar = False
Application.ScreenUpdating = True
End Sub