Question

我有一个删除重复项的程序，一切正常。它只是冻结大数据集，即1到250万字。

我的做法有什么问题？还有更好的吗？

Sub DeleteDuplicateParagraphs()

Dim p1 As Paragraph
Dim p2 As Paragraph
Dim DupCount As Long
Dim StartTime As Double
Dim SecondsElapsed As Double

StartTime = Timer

For Each p1 In ActiveDocument.Paragraphs
  If p1.range.Text <> vbCr Then

    For Each p2 In ActiveDocument.Paragraphs
      If p1.range.Text = p2.range.Text Then
        DupCount = DupCount + 1
        If p1.range.Text = p2.range.Text And DupCount > 1 Then p2.range.Delete
      End If
    Next p2
  End If
  DupCount = 0
Next p1

SecondsElapsed = Round(Timer - StartTime, 2)

MsgBox "This code ran successfully in " & SecondsElapsed & " seconds", vbInformation
DupCount = 0

End Sub

Answer 1

尝试此操作（首先将Microsoft Scripting Runtime的引用添加到您的VBA项目中）：

Sub DeleteDuplicateParagraphs()
  Dim p As Paragraph
  Dim d As New Scripting.Dictionary
  Dim t As Variant
  Dim i As Integer
  Dim StartTime As Single

  StartTime = Timer

  ' collect duplicates
  For Each p In ActiveDocument.Paragraphs
    t = p.Range.Text
    If t <> vbCr Then
      If Not d.Exists(t) Then d.Add t, New Scripting.Dictionary
      d(t).Add d(t).Count + 1, p
    End If
  Next

  ' eliminate duplicates
  Application.ScreenUpdating = False
  For Each t In d
    For i = 2 To d(t).Count
      d(t)(i).Range.Delete
    Next
  Next
  Application.ScreenUpdating = True

  MsgBox "This code ran successfully in " & Round(Timer - StartTime, 2) & " seconds", vbInformation
End Sub

这利用了Scripting.Dictionary是一个哈希表的事实，该哈希表适用于非常快速地将唯一键与值相关联。因此，它非常擅长发现重复的密钥。字典键必须是字符串，方便我们使用段落文本。

对于值，我们使用更多的字典对象，仅仅因为它们比VBA的数组工作得更好。在其中，我们使用相同的文本收集对实际段落实例的引用。

事实上，删除重复的段落是一件非常简单的事情。

注意：上面代码中的重复检测部分非常快。但是，如果Word在大型文档中没有响应，那么它就在重复删除部分中，即因为Word的撤消缓冲区。

罪魁祸首是段落范围一个接一个地删除，导致Word构建一个非常大的撤消缓冲区。不幸的是，（我知道）无论如何都

在一个步骤中删除多个单独的范围（这将导致撤消缓冲区中只有一个条目），或
从VBA

定期在＆＃34中调用UndoClear;消除重复＆＃34;循环可能会有所帮助，禁用ScreenUpdating也不是一个坏主意：

' eliminate duplicates
Dim x As Integer

Application.ScreenUpdating = False
For Each t In d
  x = x + 1
  For i = 2 To d(t).Count
    d(t)(i).Range.Delete
  Next
  If x Mod 50 = 0 Then ActiveDocument.UndoClear
Next
ActiveDocument.UndoClear
Application.ScreenUpdating = True

Answer 2

首先，我非常感谢你花时间和精力帮助我。

你对这种方法背后的想法非常令人印象深刻。我确实略微更改了代码，并希望您在有空的时候仔细阅读它，看它是否符合最佳标准。再一次，我真的非常感谢你，代码比之前的代码运行了20次分割，甚至没有更大的数据集。

   >  Sub DeleteDuplicateParagraphs()
    > 
    >   Dim p As Paragraph 

>     Set d = CreateObject("Scripting.Dictionary")  
>         Dim t As Variant   
>     Dim i As Integer   
>     Dim StartTime As Single
    > 
    >   StartTime = Timer
    > 
    >   ' collect duplicates   For Each p In ActiveDocument.Paragraphs
    >     t = p.range.Text
    >     If t <> vbCr Then
    >       If Not d.Exists(t) Then d.Add t, CreateObject("Scripting.Dictionary")
    >       d(t).Add d(t).Count + 1, p
    >     End If   Next
    > 
    >   ' eliminate duplicates   For Each t In d
    >     For i = 2 To d(t).Count
    >       d(t)(i).range.Delete
    >     Next   Next
    > 
    >   MsgBox "This code ran successfully in " & Round(Timer - StartTime,
    > 2) & " seconds", vbInformation
    > 
    > End Sub

重复删除VBA Word无法有效工作

2 个答案: