我正在处理一个宏,它将通过电子表格删除重复的条目(行),这两个条件分别在两列(Q和D列)中提供。
这就是我所拥有的。我在一个小数据集上测试了它,它是慢。
Sub RemoveDupesKeepLast()
dim i As Integer
dim criteria1, criteria2 As String
Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual
Application.EnableEvents = False
'start at bottom of sheet, go up
For i = ActiveSheet.UsedRange.Rows.Count to 2 Step -1
'if there is no entry, go to next row
If Cells(i, "Q").Value = "" Then
GoTo gogo:
End If
'set criteria that we will filter for
criteria1 = Cells(i, "D").Value
criteria2 = Cells(i, "Q").Value
'filter for criteria2, then criteria1 to get duplicates
ActiveSheet.Range("A":"CI").AutoFilter field:=17, Criteria1:=criteria2, Operator:=xlFilterValues
ActiveSheet.Range("A":"CI").AutoFilter field:=4, Criteria1:=criteria1, Operator:=xlFilterValues
'if there are duplicates, keep deleting rows until only bottom-most entry is left behind
Do While Range("Q2", Cells(Rows.Count, "Q").End(xlUp)).Cells.SpecialCells(xlCellTypeVisible).Count > 1
ActiveSheet.AutoFilter.Range.Offset(1).SpecialCells(xlCellTypeVisible).Cells(1,17).EntireRow.Delete
Loop
'reset autofilter
If ActiveSheet.FilterMode Then
Cells.AutoFilter
End If
gogo:
Next i
Application.EnableEvents = True
Application.Calculation = xlCalculationAutomatic
Application.ScreenUpdating = True
End Sub
我是否有不同的方法可以解决这个问题以加快速度?就像现在一样,我基本上检查每一行,直到我到达顶部。这些床单实际上是从30,000行到最大值。在我看来,应该有一种更快,更清洁的方式来实现我想要做的事情,但我似乎无法想到一个。
答案 0 :(得分:1)
此过程删除D列和Q列标识的所有重复行。 在重复项中,它将使行保持最接近工作表的底部。 基本上,在右侧创建索引列以对底部的所有重复行进行排序和移动,以便可以在单个调用中删除它们。 请注意,如果存在,则不会更改单元格公式或格式。
Sub DeleteDuplicatedRows()
Dim rgTable As Range, rgIndex As Range, dataColD(), dataColQ()
Set rgTable = ActiveSheet.UsedRange
' load each column representing the identifier in an array
dataColD = rgTable.Columns("D").value ' load values from column D
dataColQ = rgTable.Columns("Q").value ' load values from column Q
' get each unique row number with a dictionary
Dim dict As New VBA.collection, indexes(), r&, rr
On Error Resume Next
For r = UBound(dataColD) To 1 Step -1
dict.Add r, dataColD(r, 1) & vbNullChar & dataColQ(r, 1)
Next
On Error GoTo 0
' index all the unique rows in an array
ReDim indexes(1 To UBound(dataColD), 1 To 1)
For Each rr In dict: indexes(rr, 1) = rr: Next
' insert the indexes in the last column on the right
Set rgIndex = rgTable.Columns(rgTable.Columns.count + 1)
rgIndex.value = indexes
' sort the rows on the indexes, duplicates will move at the end
Union(rgTable, rgIndex).Sort key1:=rgIndex, Orientation:=xlTopToBottom, Header:=xlYes
' delete the index column on the right and the empty rows at the bottom
rgIndex.EntireColumn.Delete
rgTable.Resize(UBound(dataColD) - dict.count + 1).offset(dict.count).EntireRow.Delete
End Sub
答案 1 :(得分:1)
100,000行×87列,40.3秒。
如果你的数据集从30K行开始并且只是变大,你应该尽可能地寻找内存处理¹。我已经调整了this solution中使用的方法,以更加严格地遵循您的要求。
以下批量将所有值加载到变量数组中,并从结果中构建Scripting.Dictionary对象。使用向字典添加键的'覆盖'方法,以便只保留最后一个。
执行排序规则后,值将返回到重新调整大小的变量数组,并还原到工作表 en masse 。
<强> Module1 (Code) 强>
Option Explicit
Sub removeDupesKeepLast()
Dim d As Long, dDQs As Object, ky As Variant
Dim r As Long, c As Long, vVALs As Variant, vTMP As Variant
'appTGGL bTGGL:=False 'uncomment this when you have finished debugging
Set dDQs = CreateObject("Scripting.Dictionary")
dDQs.comparemode = vbTextCompare
'step 1 - bulk load the values
With Worksheets("Sheet1") 'you should know what worksheet you are on
With .Cells(1, 1).CurrentRegion 'block of data radiating out from A1
With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0) 'step off the header row
vVALs = .Value 'use .Value2 if you do not have dates in unformatted cells
End With
End With
End With
'step 2 - build the dictionary
ReDim vTMP(UBound(vVALs, 2) - 1)
For r = LBound(vVALs, 1) To UBound(vVALs, 1)
For c = LBound(vVALs, 2) To UBound(vVALs, 2)
vTMP(c - 1) = vVALs(r, c)
Next c
dDQs.Item(vVALs(r, 4) & ChrW(8203) & vVALs(r, 17)) = vTMP
Next r
'step 3 - put the de-duplicated values back into the array
r = 0
ReDim vVALs(1 To dDQs.Count, LBound(vVALs, 2) To UBound(vVALs, 2))
For Each ky In dDQs
r = r + 1
vTMP = dDQs.Item(ky)
For c = LBound(vTMP) To UBound(vTMP)
vVALs(r, c + 1) = vTMP(c)
Next c
Next ky
'step 4 - clear the destination; put the de-duplicated values back into the worksheet and reset .UsedRange
With Worksheets("Sheet1") 'you should know what worksheet you are on
With .Cells(1, 1).CurrentRegion 'block of data radiating out from A1
With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0) 'step off the header row
.ClearContents 'retain formatting if it is there
.Cells(1, 1).Resize(UBound(vVALs, 1), UBound(vVALs, 2)) = vVALs
End With
End With
.UsedRange 'assert the UsedRange property (refreshes it)
End With
dDQs.RemoveAll: Set dDQs = Nothing
appTGGL
End Sub
Public Sub appTGGL(Optional bTGGL As Boolean = True)
With Application
.ScreenUpdating = bTGGL
.EnableEvents = bTGGL
.DisplayAlerts = bTGGL
.AutoRecover.Enabled = bTGGL 'no interruptions with an auto-save
.Calculation = IIf(bTGGL, xlCalculationAutomatic, xlCalculationManual)
.CutCopyMode = False
.StatusBar = vbNullString
End With
Debug.Print Timer
End Sub
我的样本工作簿占用了100K行×87列,重复约24%,并在~40秒内处理了所有重复项(保留最后的条目)。以上回写到Sheet1;我的测试运行写回Sheet2以保留原始数据。如果您选择回写到其他工作表,请确保从A1开始有一些值,以便可以正确识别Range.CurrentRegion property。测试机器是运行32位Excel 2010的旧笔记本电脑;你自己的结果可能会有所不同。
¹有关Excel中处理大型数据集的提示,请参阅Highlight Duplicates and Filter by color alternative。