Excel - 字符串删除重复项

时间:2016-03-10 09:23:00

标签: regex string excel vba duplicates

我正在使用一些英国地址数据,这些数据在Excel单元格中用逗号分成其组成部分。

我有一些VBA,我从网上删除了一些精确重复的条目,但是我留下了大量的数据,这些数据有一些顺序的重复段和一些非顺序的。

附件是突出我想要实现的图像,到目前为止我使用的代码不是我的代码,用于向您显示我一直在寻找的方向。任何人都有进一步的想法如何实现这一目标?

Function stringOfUniques(inputString As String, delimiter As String)
Dim xVal As Variant
Dim dict As Object
Set dict = CreateObject("Scripting.Dictionary")

For Each xVal In Split(inputString, delimiter)
dict(xVal) = xVal
Next xVal

stringOfUniques = Join(dict.Keys(), ",")
End Function

这确实设法摆脱了其中的一些,但是我正在努力开发一个庞大的人口,因此自动化这将是不可思议的。

Ideal Outcome

3 个答案:

答案 0 :(得分:4)

可能不是最优雅的答案,但这可以解决问题。 这里我使用Split命令在每个逗号分割字符串。 从此返回的结果是

bat ball banana

代码:

Option Explicit
Private Sub test()
 Dim Mystring As String
 Dim StrResult As String

 Mystring = "bat,ball,bat,ball,banana"
 StrResult = shed_duplicates(Mystring)
End Sub
Private Function shed_duplicates(ByRef Mystring As String) As String
 Dim MySplitz() As String
 Dim J As Integer
 Dim K As Integer
 Dim BooMatch As Boolean
 Dim StrTemp(10) As String ' assumes no more than 10 possible splits!
 Dim StrResult As String


 MySplitz = Split(Mystring, ",")
  For J = 0 To UBound(MySplitz)
     BooMatch = False
     For K = 0 To UBound(StrTemp)
         If MySplitz(J) = StrTemp(K) Then
            BooMatch = True
            Exit For
         End If
     Next K
    If Not BooMatch Then
       StrTemp(J) = MySplitz(J)
    End If
Next
For J = 0 To UBound(StrTemp)
   If Len(StrTemp(J)) > 0 Then ' ignore blank entries
      StrResult = StrResult + StrTemp(J) + " "
   End If
Next J
Debug.Print StrResult
End Function

答案 1 :(得分:3)

您可能真的使用正则表达式替换:

^(\d*\s*([^,]*),.*)\2(,|$)

替换模式是

$1$3

请参阅regex demo模式说明

  • ^ - 字符串的开头(或.MultiLine = True的行
  • (\d*\s*([^,]*),.*) - 第1组(后来引用替换模式的$1反向引用)匹配:
    • \d* - 0+位后跟
    • \s* - 0+空格字符
    • ([^,]*) - 第2组(稍后我们可以使用\2模式内反向引用来引用此子模式捕获的值)匹配除逗号以外的0+个字符
    • ,.* - 一个逗号,后跟除了换行符之外的0 +个字符
  • \2 - 第2组捕获的文字
  • (,|$) - 第3组(后来用替换模式中的$3引用 - 恢复逗号)匹配逗号或字符串结尾(或.MultiLine = True行)

注意:如果您只检查包含一个地址的单个单元格,则不需要.MultiLine = True

下面是一个示例VBA Sub,展示了如何在VBA中使用它:

Sub test()
  Dim regEx As Object
  Set regEx = CreateObject("VBScript.RegExp")
  With regEx
      .pattern = "^(\d*\s*([^,]*),.*)\2(,|$)"
      .Global = True
      .MultiLine = True ' Remove if individual addresses are matched
  End With
  s = "66 LAUSANNE ROAD,LAUSANNE ROAD,HORNSEY" & vbCrLf & _
      "9 CARNELL LANE,CARNELL LANE,FERNWOOD" & vbCrLf & _
      "35 FLAT ANDERSON HEIGHTS,1001 LONDON ROAD,FLAT ANDERSON HEIGHTS" & vbCrLf & _
      "27 RUSSELL BANK ROAD,RUSSEL BANK,SUTTON COLDFIELD"
  MsgBox regEx.Replace(s, "$1$3")
End Sub

enter image description here

答案 2 :(得分:1)

第一种解决方案是使用字典来获取唯一段的列表。 然后就像在拆分段之前跳过第一个地址号一样简单:

Function RemoveDuplicates1(text As String) As String
  Static dict As Object
  If dict Is Nothing Then
    Set dict = CreateObject("Scripting.Dictionary")
    dict.CompareMode = 1  ' set the case sensitivity to All
  Else
    dict.RemoveAll
  End If

  ' Get the position just after the address number
  Dim c&, istart&, segment
  For istart = 1 To Len(text)
    c = Asc(Mid$(text, istart, 1))
    If (c < 48 Or c > 57) And c <> 32 Then Exit For  ' if not [0-9 ]
  Next

  ' Split the segments and add each one of them to the dictionary. No need to keep 
  ' a reference to each segment since the keys are returned by order of insertion.
  For Each segment In Split(Mid$(text, istart), ",")
    If Len(segment) Then dict(segment) = Empty
  Next

  ' Return the address number and the segments by joining the keys
  RemoveDuplicates1 = Mid$(text, 1, istart - 1) & Join(dict.keys(), ",")
End Function

第二个解决方案是提取所有段,然后搜索它们中的每一个是否存在于先前位置:

Function RemoveDuplicates2(text As String) As String
  Dim c&, segments$, segment$, length&, ifirst&, istart&, iend&

  ' Get the position just after the address number
  For ifirst = 1 To Len(text)
    c = Asc(Mid$(text, ifirst, 1))
    If (c < 48 Or c > 57) And c <> 32 Then Exit For  ' if not [0-9 ]
  Next

  ' Get the segments without the address number and add a leading/trailing comma
  segments = "," & Mid$(text, ifirst) & ","
  istart = 1

  ' iterate each segment
  Do While istart < Len(segments)

    ' Get the next segment position
    iend = InStr(istart + 1, segments, ",") - 1 And &HFFFFFF
    If iend - istart Then

      ' Get the segment
      segment = Mid$(segments, istart, iend - istart + 2)

      ' Rewrite the segment if not present at a previous position
      If InStr(1, segments, segment, vbTextCompare) = istart Then
        Mid$(segments, length + 1) = segment
        length = length + Len(segment) - 1
      End If
    End If

    istart = iend + 1
  Loop

  ' Return the address number and the segments
  RemoveDuplicates2 = Mid$(text, 1, ifirst - 1) & Mid$(segments, 2, length - 1)

End Function

第三种解决方案是使用正则表达式删除所有重复的段:

Function RemoveDuplicates3(ByVal text As String) As String

  Static re As Object
  If re Is Nothing Then
    Set re = CreateObject("VBScript.RegExp")
    re.Global = True
    re.IgnoreCase = True
    ' Match any duplicated segment separated by a comma.
    ' The first segment is compared without the first digits.
    re.Pattern = "((^\d* *|,)([^,]+)(?=,).*),\3?(?=,|$)"
  End If

  ' Remove each matching segment
  Do While re.test(text)
    text = re.Replace(text, "$1")
  Loop

  RemoveDuplicates3 = text
End Function

这些是10000次迭代的执行时间(越低越好):

input text  : "123 abc,,1 abc,abc 2,ABC,abc,a,c"
output text : "123 abc,1 abc,abc 2,a,c"

RemoveDuplicates1 (dictionary)  : 718 ms
RemoveDuplicates2 (text search) : 219 ms
RemoveDuplicates3 (regex)       : 1469 ms