加快在Excel中循环遍历大型数据集

时间:2016-07-26 08:36:26

标签: excel-vba loops vba excel

我需要比较两个数据集并从中提取匹配项。我有一个来自每个数据集中5列的复合键,结束了我需要提取的第6列。列由文本,日期和整数组成。两组都略低于500k行。

目前我在表a中使用for循环并循环遍历表b。使用带有和参数的if语句比较行以获取复合键。

Sub ArraySearch()

    Dim Main As Long
    Dim Search As Long
    Dim arrData() As Variant
    Dim arrSource As Variant

    arrData = Sheets("Sheet1").Range("H3:M500000").Value
    arrSource = Sheets("Ark1").Range("A3:H500000").Value

    Main = 1
    Search = 1

    For Main = 1 To UBound(arrSource, 1)

        For Search = 1 To UBound(arrData, 1)

            If arrSource(Main, 3) = arrData(Search, 1) And _
                arrSource(Main, 4) = arrData(Search, 2) And _
                arrSource(Main, 1) = arrData(Search, 3) And _
                arrSource(Main, 2) = arrData(Search, 4) And _
                arrSource(Main, 5) = arrData(Search, 5) _
            Then
                arrSource(Main, 8) = arrData(Search, 6)
                Exit For
            End If

        Next
    Next

    Sheets("Sheet2").Range("A3:H500000") = arrSource

End Sub

到目前为止,最快的方法是将两个表加载到一个数组中并执行内存循环。

这是永远的。我们说的是几小时而不是几分钟。

有没有什么方法可以提高速度? 或者我需要使用其他一些程序吗? (将其加载到数据库并使用SQL,使用Visual Studio与普通的VB.net,SSIS)

我希望这可以在VBA中完成,所以任何指针都会非常感激。

修改

散列5列键是否会提高速度,还是必须迭代的行的共享量会产生滞后?

4 个答案:

答案 0 :(得分:5)

比较两个列表的最快方法是根据公共密钥向Dictionary添加值。字典经过优化,可以搜索键,并且可以更快地返回基于键的值,然后可以迭代数组。

Sub DictionarySearch()
    Dim dict
    Dim key As String
    Dim x As Long
    Dim arrData() As Variant
    Dim arrSource As Variant

    Set dict = CreateObject("Scripting.Dictionary")

    arrData = Worksheets("Sheet1").Range("H3:M500000").Value
    arrSource = Worksheets("Ark1").Range("A3:H500000").Value

    For x = 1 To UBound(arrData, 1)
        key = arrData(x, 1) & ":" & arrData(x, 2) & ":" & arrData(x, 3) & ":" & arrData(x, 4) & ":" & arrData(x, 5)
        If Not dict.Exists(key) Then dict.Add key, arrData(x, 6)

    Next

    For x = 1 To UBound(arrSource, 1)
        key = arrSource(x, 3) & ":" & arrSource(x, 4) & ":" & arrSource(x, 1) & ":" & arrSource(x, 2) & ":" & arrSource(x, 5)
        If dict.Exists(key) Then arrSource(x, 8) = dict(key)
    Next

    Sheets("Sheet2").Range("A3:H500000") = arrSource
End Sub

答案 1 :(得分:2)

不是一个完整的答案,但值得一试的想法。在this answer of mine to my own question中,我使用了一些加速技巧,比如使用.Value2而不是默认属性(.Value)并指定 vbNullString 而不是零长度字符串(" ")找到匹配的数组元素,使Excel处理更少。也许您可以Heap's algorithm使用this answer,但我不太确定。

答案 2 :(得分:1)

欢迎来到性能改进的奇妙世界: - )

让我解释一下你在做什么: 您正在获取两个数据集,每个数据集包含500,000个条目。然后你循环遍历它们,如下所示:

for every member in dataset1 do
  for every member in dataset2 do
    if condition1 is met, and
    if condition2 is met, and
    if condition3 is met, and
    if condition4 is met, and
    if condition5 is met
    then do something
    end if-loop
  end for-loop (dataset2)
end for-loop (dataset1)

当您计算您正在执行的操作数量时,我们会看到以下内容:

500,000 runs through dataset1
500,000 runs through dataset2
5 (number of conditions to check)
=> 1250,000,000,000 actions, this is enormous!

最重要的是,你正在使用VBA:VBA是一种脚本语言,这意味着你到达这一行代码时,每一行代码都被翻译成机器语言(如果你要使用另一种语言,那么你可以编译,翻译成机器语言只会进行一次,这种机器语言将在之后执行)

如果您想继续使用VBA,我可以提供两条建议:

  1. 如果可以,请尝试使用已排序的数据集
  2. 尽量减少要检查的条件数
  3. 这将导致这种新算法:

    for every member in dataset1 do
      go in dataset2 from the start to the maximum, defined by the first for-loop, and do
        if condition1 is met, then:
          if condition2 is met, then:
            if condition3 is met, then:
              if condition4 is met, then:
                if condition5 is met
                then do something
                end if-loop
              end if-loop
            end if-loop
          end if-loop
        end if-loop
      end for-loop (dataset2)
    end for-loop (dataset1)
    

    这种工作方式可以减少计算机执行的操作量:

    500,000 runs through dataset1
    log(500,000) runs through dataset2 (it's only browsed until a certain limit)
    3 conditions (on average)
    => 500,000 * log(500,000) * 3 = 8,500,000 actions (on average), which is manageable
    

    我希望这对你有意义。在我看来,这里的主要问题是对数据集进行排序,这将大大提高您的性能!

答案 3 :(得分:0)

Excel需要评估的大量数据。 问题是,Excel是最好的解决方案吗,或者如果你用C ++或C#编写用于比较的应用程序会更好吗?因为它们会快得多。

但是如果您需要在VBA中执行此操作,则此代码可以帮助您。如果它们具有相同的数据,我总是使用它来比较2个范围。而且我从未对该代码有过速度问题,所以也许你可以看看它。

    Sub Start()
    Dim rng1 As Range
    Dim rng2 As Range
    Dim bolNotEqual As Boolean
    Set rng1 = Sheets("Sheet1").Range("H3:M500000").Value
    Set rng2 = arrSource = Sheets("Ark1").Range("A3:H500000").Value

    'Compare the Sheets if both are Equal
    Call CompareWorksheetRanges(rng1, rng2, bolNotEqual)

End Sub



Sub CompareWorksheetRanges(rng1 As Range, rng2 As Range, ByRef bol As Boolean)

Dim r As Long, c As Integer

Dim lr1 As Long, lr2 As Long, lc1 As Integer, lc2 As Integer
Dim maxR As Long, maxC As Integer, cf1 As String, cf2 As String

Dim rptWB As Workbook, DiffCount As Long

'If one rng is Empty Exit sup
If rng1 Is Nothing Or rng2 Is Nothing Then Exit Sub


    If rng1.Areas.Count > 1 Or rng2.Areas.Count > 1 Then
        MsgBox "Can't compare multiple selections!", _
            vbExclamation, "Compare Worksheet Ranges"
        Exit Sub
    End If

    Application.StatusBar = "Creating the report..."
    'Testing if the Ranges have the Same sice
    Set rptWB = Workbooks.Add

    With rng1
        lr1 = .Rows.Count
        lc1 = .Columns.Count
    End With
    With rng2
        lr2 = .Rows.Count
        lc2 = .Columns.Count
    End With
    maxR = lr1
    maxC = lc1
    If maxR < lr2 Then maxR = lr2
    If maxC < lc2 Then maxC = lc2
    If lr1 <> lr2 Or lc1 <> lc2 Then
        If MsgBox("The two ranges you want to compare are of different size!" & _
            Chr(13) & "Do you want to continue anyway?", _
            vbQuestion + vbYesNo, "Compare Worksheet Ranges") = vbNo Then Exit Sub
    End If

    'End Testing sice




     DiffCount = 0
'Compare the Ranges if same Value
        For c = 1 To maxC
            Application.StatusBar = "Comparing cells " & _
                Format(c / maxC, "0 %") & "..."
            For r = 1 To maxR
                cf1 = ""
                cf2 = ""
                On Error Resume Next
                cf1 = rng1.Cells(r, c).FormulaLocal
                cf2 = rng2.Cells(r, c).FormulaLocal
                On Error GoTo 0
                If cf1 <> cf2 Then
                    DiffCount = DiffCount + 1
                    Cells(r, c).Formula = "'" & cf1 & " <> " & cf2
                End If
            Next r
        Next c

        Application.StatusBar = "Formatting the report..."



        rptWB.Close False

        Set rptWB = Nothing

        If DiffCount = 0 Then
            bol = False
        Else
            bol = True

        End If
        Application.StatusBar = False
        Application.ScreenUpdating = True

    End Sub