Question

我正在处理一个宏，它将通过电子表格删除重复的条目（行），这两个条件分别在两列（Q和D列）中提供。

这就是我所拥有的。我在一个小数据集上测试了它，它是慢。

Sub RemoveDupesKeepLast()
dim i As Integer
dim criteria1, criteria2 As String

Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual
Application.EnableEvents = False

'start at bottom of sheet, go up
For i = ActiveSheet.UsedRange.Rows.Count to 2 Step -1

    'if there is no entry, go to next row
    If Cells(i, "Q").Value = "" Then
        GoTo gogo:
    End If

    'set criteria that we will filter for
    criteria1 = Cells(i, "D").Value
    criteria2 = Cells(i, "Q").Value

    'filter for criteria2, then criteria1 to get duplicates
    ActiveSheet.Range("A":"CI").AutoFilter field:=17, Criteria1:=criteria2, Operator:=xlFilterValues
    ActiveSheet.Range("A":"CI").AutoFilter field:=4, Criteria1:=criteria1, Operator:=xlFilterValues

    'if there are duplicates, keep deleting rows until only bottom-most entry is left behind
    Do While Range("Q2", Cells(Rows.Count, "Q").End(xlUp)).Cells.SpecialCells(xlCellTypeVisible).Count > 1
        ActiveSheet.AutoFilter.Range.Offset(1).SpecialCells(xlCellTypeVisible).Cells(1,17).EntireRow.Delete
    Loop

    'reset autofilter
    If ActiveSheet.FilterMode Then
        Cells.AutoFilter
    End If

gogo:
Next i

Application.EnableEvents = True
Application.Calculation = xlCalculationAutomatic
Application.ScreenUpdating = True

End Sub

我是否有不同的方法可以解决这个问题以加快速度？就像现在一样，我基本上检查每一行，直到我到达顶部。这些床单实际上是从30,000行到最大值。在我看来，应该有一种更快，更清洁的方式来实现我想要做的事情，但我似乎无法想到一个。

Answer 1

此过程删除D列和Q列标识的所有重复行。在重复项中，它将使行保持最接近工作表的底部。基本上，在右侧创建索引列以对底部的所有重复行进行排序和移动，以便可以在单个调用中删除它们。请注意，如果存在，则不会更改单元格公式或格式。

Sub DeleteDuplicatedRows()
  Dim rgTable As Range, rgIndex As Range, dataColD(), dataColQ()

  Set rgTable = ActiveSheet.UsedRange

  ' load each column representing the identifier in an array
  dataColD = rgTable.Columns("D").value  ' load values from column D
  dataColQ = rgTable.Columns("Q").value  ' load values from column Q

  ' get each unique row number with a dictionary
  Dim dict As New VBA.collection, indexes(), r&, rr
  On Error Resume Next
  For r = UBound(dataColD) To 1 Step -1
    dict.Add r, dataColD(r, 1) & vbNullChar & dataColQ(r, 1)
  Next
  On Error GoTo 0

  ' index all the unique rows in an array
  ReDim indexes(1 To UBound(dataColD), 1 To 1)
  For Each rr In dict: indexes(rr, 1) = rr: Next

  ' insert the indexes in the last column on the right
  Set rgIndex = rgTable.Columns(rgTable.Columns.count + 1)
  rgIndex.value = indexes

  ' sort the rows on the indexes, duplicates will move at the end
  Union(rgTable, rgIndex).Sort key1:=rgIndex, Orientation:=xlTopToBottom, Header:=xlYes

  ' delete the index column on the right and the empty rows at the bottom
  rgIndex.EntireColumn.Delete
  rgTable.Resize(UBound(dataColD) - dict.count + 1).offset(dict.count).EntireRow.Delete

End Sub

Answer 2

100,000行×87列，40.3秒。

如果你的数据集从30K行开始并且只是变大，你应该尽可能地寻找内存处理¹。我已经调整了this solution中使用的方法，以更加严格地遵循您的要求。

以下批量将所有值加载到变量数组中，并从结果中构建Scripting.Dictionary对象。使用向字典添加键的'覆盖'方法，以便只保留最后一个。

执行排序规则后，值将返回到重新调整大小的变量数组，并还原到工作表 en masse 。

<强> Module1 (Code)

Option Explicit

Sub removeDupesKeepLast()
    Dim d As Long, dDQs As Object, ky As Variant
    Dim r As Long, c As Long, vVALs As Variant, vTMP As Variant

    'appTGGL bTGGL:=False   'uncomment this when you have finished debugging

    Set dDQs = CreateObject("Scripting.Dictionary")
    dDQs.comparemode = vbTextCompare

    'step 1 - bulk load the values
    With Worksheets("Sheet1")   'you should know what worksheet you are on
        With .Cells(1, 1).CurrentRegion 'block of data radiating out from A1
            With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0) 'step off the header row
                vVALs = .Value  'use .Value2 if you do not have dates in unformatted cells
            End With
        End With
    End With

    'step 2 - build the dictionary
    ReDim vTMP(UBound(vVALs, 2) - 1)
    For r = LBound(vVALs, 1) To UBound(vVALs, 1)
        For c = LBound(vVALs, 2) To UBound(vVALs, 2)
            vTMP(c - 1) = vVALs(r, c)
        Next c
        dDQs.Item(vVALs(r, 4) & ChrW(8203) & vVALs(r, 17)) = vTMP
    Next r

    'step 3 - put the de-duplicated values back into the array
    r = 0
    ReDim vVALs(1 To dDQs.Count, LBound(vVALs, 2) To UBound(vVALs, 2))
    For Each ky In dDQs
        r = r + 1
        vTMP = dDQs.Item(ky)
        For c = LBound(vTMP) To UBound(vTMP)
            vVALs(r, c + 1) = vTMP(c)
        Next c
    Next ky

    'step 4 - clear the destination; put the de-duplicated values back into the worksheet and reset .UsedRange
    With Worksheets("Sheet1")   'you should know what worksheet you are on
        With .Cells(1, 1).CurrentRegion 'block of data radiating out from A1
            With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0) 'step off the header row
                .ClearContents  'retain formatting if it is there
                .Cells(1, 1).Resize(UBound(vVALs, 1), UBound(vVALs, 2)) = vVALs
            End With
        End With
        .UsedRange   'assert the UsedRange property (refreshes it)
    End With

    dDQs.RemoveAll: Set dDQs = Nothing

    appTGGL
End Sub

Public Sub appTGGL(Optional bTGGL As Boolean = True)
    With Application
        .ScreenUpdating = bTGGL
        .EnableEvents = bTGGL
        .DisplayAlerts = bTGGL
        .AutoRecover.Enabled = bTGGL   'no interruptions with an auto-save
        .Calculation = IIf(bTGGL, xlCalculationAutomatic, xlCalculationManual)
        .CutCopyMode = False
        .StatusBar = vbNullString
    End With
    Debug.Print Timer
End Sub

我的样本工作簿占用了100K行×87列，重复约24％，并在~40秒内处理了所有重复项（保留最后的条目）。以上回写到Sheet1;我的测试运行写回Sheet2以保留原始数据。如果您选择回写到其他工作表，请确保从A1开始有一些值，以便可以正确识别Range.CurrentRegion property。测试机器是运行32位Excel 2010的旧笔记本电脑;你自己的结果可能会有所不同。

¹_{有关Excel中处理大型数据集的提示，请参阅Highlight Duplicates and Filter by color alternative。}

删除重复项，保留最后一项 - 优化

2 个答案: