Excel - 识别唯一值模式并按列降序返回输出,针对500,000多行进行了优化

时间:2017-07-11 22:37:06

标签: excel list excel-vba pattern-matching unique vba

这是我已经工作了一年多的大规模数据清理任务的第三个也是最后一个问题。感谢Stack Overflow社区帮助解决问题:

问题1- Index multiple columns and Match distinct values...
问题2- Count unique values that match ID, optimized for 100,000+ cases

我不能100%确定excel是否可以实现以下目标,但我会尽力描述我面临的数据清理和组织挑战。

我有一系列数据标记/属性,这些标记/属性在24列中以随机顺序排列,跨越500,000多行。下面的图1是原始形式的数据的示例,跨越12列呈现并跨越22行以简化说明。列A到L包含原始数据,列M到X表示所需的输出。

Image 1:

任务摘要:需要完成的是一系列匹配函数,它们搜索所有索引列(在本例中为A列到L列)以标识唯一值(例如1),搜索范围内的值(在本例中为A2:L21范围),识别唯一值的相邻值(对于值1,相邻值为2和13-XR),然后以最常发生的降序输出它们在包含任何有问题的值的每一行中最不经常出现的值(在这种情况下,1发生5次,放在M2到M6; 2发生3次,放在N2到N6; 13-XR发生2次,置于O2至O6)。

为了澄清,下面是一个逐步描述,使用颜色来说明原始数据(列A到L)中的模式匹配以及这些模式应该如何在输出中呈现(列M到X)。我已将以下每个图像分成原始数据中的六种模式。

Image 2:

上面的图像是VBA解决方案将识别的第一个模式。它会识别" 1"作为一个唯一值,并通过A:L范围搜索" 1" (以蓝色突出显示),然后识别在同一行中可以找到的所有值:" 2"在第3,第5和第6行(以绿色突出显示);和" 13-XR"在第4行和第5行(以粉色突出显示)。然后需要对" 2"进行识别相邻值(" 1"" 13-XR"),然后对" 13-XR",识别(" 1"和" 2"作为相邻值)。输出将返回M列中最常出现的唯一值(" 1"出现5次),然后第二次出现在列N中(" 2"出现3次) ,第三个出现在O列(" 13-XR"出现2次)。

Image 3:

以上情况稍微复杂一些。 VBA将识别" 3"作为唯一值,搜索A:L范围以用于" 3"并识别与其相邻的所有值(在这种情况下," 4"," 7"和" 9")。然后它会对" 4"执行相同的操作,识别所有相邻的值(仅#34; 3");然后为" 7",识别相邻的值(" 9"," 3"和" 12");那么" 9"识别(" 7"和" 3");最后,对于" 12"识别相邻值(仅#34; 7")。然后对于存在任何这些值的每一行,输出将返回" 3"在M列(发生三次)和一次" 7"在N栏(也出现三次);如果计数相等,它们可以以递增的方式呈现给A或Z或从最小到最大...或者只是随机的,对于我的目的,相等计数的排序是任意的。 " 9"将在O列中返回,因为它出现两次,然后" 4"在P栏和" 12"在Q列中,因为它们都出现一次,但是12次大于4.

Image 4:

上面的图像表示可能是常见事件,其中只有一个唯一值。在这里," 5"未在范围内的任何其他列中标识。因此返回为" 5"在M列中,每一行都有一个" 5"在场。

Image 5:

这将是另一种更常见的事件,其中一行中可能存在一个值而另一行中存在两个值。在这种情况下" 6"仅在该范围内识别一次并且" 8"是找到的唯一相邻值。什么时候" 8"搜索它只返回一个相邻值的实例" 6"。在这里," 8"发生两次" 6"只有一次,从而导致" 8"在M栏和" 6"无论在哪里" 8"或者" 6"在行中出现。

Image 6:

此处" 10"," 111"," 112"," 543"," 433",& #34; 444"和" 42-FG"被识别为在A:L范围内彼此相关的唯一值。除" 10"之外的所有值发生两次,在M到S列中按降序返回。

Image 7:

这个最终模式以与上面相同的方式识别,只有更多的唯一值(n = 10)。

最后注意事项:我不知道如何在Excel中完成此任务,但我希望其他人知道如何推动这一问题。以下是有关可能有助于解决问题的数据的其他说明:

  • 第一列将始终按升序排序。如果它简化了事情,我可以做其他的自定义排序。
  • 在~500,000行中,15%只有一个属性值(A列中的一个值),30%有两个属性值(col A中有1个值,col B中有1个值),13%有三个属性值属性值(col A,B和C中的1个值)。
  • 我在这个例子中提供了一些小数字。每个单元格中的实际原始数据值将接近20个字符。
  • 除了以降序呈现模式之外的所有事情的解决方案绝对很酷。排序会很好,但如果它造成太多麻烦,我可以没有它。

如果此说明中的任何内容需要进一步说明,或者我可以提供其他信息,请告知我们,我会根据需要进行调整。

提前感谢能够帮助我解决最终挑战的任何人。

附录:

完整数据集发生内存错误。 @ambie发现错误的来源是1000s内相邻的链(结果)编号(试图在1000列中返回结果)。似乎问题不在于解决方案或数据,只是在excel中达到了限制。一个可能的解决方案是(见下图)添加两个新列(ATT_COUNT作为列M; ATT_ALL作为列Z)。 M列中的ATT_COUNT将返回通常在列之间返回的唯一值的总数。在N到Y列(ATT_1_CL到ATT_12_CL)中仅返回最常出现的前12个值。要绕过ATT_COUNT>的实例。 12(以及1000以上),我们可以在ATT_ALL(列Z)中以空格分隔格式返回所有唯一值。例如,在下图中,行17,18,19和21在链中具有17个唯一值。只有前12个最常出现的值显示在N到Y列中。所有17个值在Z列中以空格分隔格式显示。

image 8

这是a link to this mini example test data

这是a link to a mid sized sample of test data of ~50k rows

这是a link to the full sized sample test data of ~500k rows

1 个答案:

答案 0 :(得分:2)

我们通常不提供'代码供您服务',但我知道在之前的问题中您提供了一些您尝试过的示例代码,我可以看到您不知道从哪里开始。

对于未来的编码工作,诀窍是将问题分解为单个任务。对于您的问题,这些将是:

  1. 识别所有唯一值并获取所有相邻值的列表 - 非常简单。
  2. 创建一个'链'列表,将一个相邻值链接到下一个 - 这更加尴尬,因为尽管列表显示已排序,但相邻值不是,因此列表中相对较低的值可能与一个更高的值已经是链的一部分(样本中的3就是一个例子)。因此,最简单的方法是在读取所有唯一值后才分配链。
  3. 将每个唯一值映射到其相应的“链” - 我通过为链创建索引并将相关值分配给唯一值来完成此操作。
  4. Collection对象非常适合您,因为它们处理重复的问题,允许您填充未知大小的列表,并使用Key属性轻松进行值映射。为了使编码易于阅读,我创建了一个包含一些字段的类。首先,插入一个类模块并将其命名为 cItem 。这个类背后的代码是:

    Option Explicit
    
    Public Element As String
    Public Frq As Long
    Public AdjIndex As Long
    Public Adjs As Collection
    
    Private Sub Class_Initialize()
        Set Adjs = New Collection
    End Sub
    

    在您的模块中,任务可以编码如下:

    Dim data As Variant, adj As Variant
    Dim uniques As Collection, chains As Collection, chain As Collection
    Dim oItem As cItem, oAdj As cItem
    Dim r As Long, c As Long, n As Long, i As Long, maxChain As Long
    Dim output() As Variant
    
    'Read the data.
    'Note: Define range as you need.
    With Sheet1
        data = .Range(.Cells(2, "A"), _
                      .Cells(.Rows.Count, "A").End(xlUp)) _
               .Resize(, 12) _
               .Value2
    End With
    
    'Find the unique values
    Set uniques = New Collection
    For r = 1 To UBound(data, 1)
        For c = 1 To UBound(data, 2)
            If IsEmpty(data(r, c)) Then Exit For
            Set oItem = Nothing: On Error Resume Next
            Set oItem = uniques(CStr(data(r, c))): On Error GoTo 0
            If oItem Is Nothing Then
                Set oItem = New cItem
                oItem.Element = CStr(data(r, c))
                uniques.Add oItem, oItem.Element
            End If
            oItem.Frq = oItem.Frq + 1
            'Find the left adjacent value
            If c > 1 Then
                On Error Resume Next
                oItem.Adjs.Add uniques(CStr(data(r, c - 1))), CStr(data(r, c - 1))
                On Error GoTo 0
            End If
            'Find the right adjacent value
            If c < UBound(data, 2) Then
                If Not IsEmpty(data(r, c + 1)) Then
                    On Error Resume Next
                    oItem.Adjs.Add uniques(CStr(data(r, c + 1))), CStr(data(r, c + 1))
                    On Error GoTo 0
                End If
            End If
        Next
    Next
    
    'Define the adjacent indexes.
    For Each oItem In uniques
        'If the item has a chain index, pass it to the adjacents.
        If oItem.AdjIndex <> 0 Then
            For Each oAdj In oItem.Adjs
                oAdj.AdjIndex = oItem.AdjIndex
            Next
        Else
            'If an adjacent has a chain index, pass it to the item.
            i = 0
            For Each oAdj In oItem.Adjs
                If oAdj.AdjIndex <> 0 Then
                    i = oAdj.AdjIndex
                    Exit For
                End If
            Next
            If i <> 0 Then
                oItem.AdjIndex = i
                For Each oAdj In oItem.Adjs
                    oAdj.AdjIndex = i
                Next
            End If
            'If we're still missing a chain index, create a new one.
            If oItem.AdjIndex = 0 Then
                n = n + 1
                oItem.AdjIndex = n
                For Each oAdj In oItem.Adjs
                    oAdj.AdjIndex = n
                Next
            End If
        End If
    Next
    
    'Populate the chain lists.
    Set chains = New Collection
    For Each oItem In uniques
        Set chain = Nothing: On Error Resume Next
        Set chain = chains(CStr(oItem.AdjIndex)): On Error GoTo 0
        If chain Is Nothing Then
            'It's a new chain so create a new collection.
            Set chain = New Collection
            chain.Add oItem.Element, CStr(oItem.Element)
            chains.Add chain, CStr(oItem.AdjIndex)
        Else
            'It's an existing chain, so find the frequency position (highest first).
            Set oAdj = uniques(chain(chain.Count))
            If oItem.Frq <= oAdj.Frq Then
                chain.Add oItem.Element, CStr(oItem.Element)
            Else
                For Each adj In chain
                    Set oAdj = uniques(adj)
                    If oItem.Frq > oAdj.Frq Then
                        chain.Add Item:=oItem.Element, Key:=CStr(oItem.Element), Before:=adj
                        Exit For
                    End If
                Next
            End If
        End If
        'Get the column count of output array
        If chain.Count > maxChain Then maxChain = chain.Count
    Next
    
    'Populate each row with the relevant chain
    ReDim output(1 To UBound(data, 1), 1 To maxChain)
    For r = 1 To UBound(data, 1)
        Set oItem = uniques(CStr(data(r, 1)))
        Set chain = chains(CStr(oItem.AdjIndex))
        c = 1
        For Each adj In chain
            output(r, c) = adj
            c = c + 1
        Next
    Next
    
    'Write the output to sheet.
    'Note: adjust range to suit.
    Sheet1.Range("M2").Resize(UBound(output, 1), UBound(output, 2)).Value = output
    

    这不是最有效的方法,但它确实使每项任务对您更加明显。我不确定我是否理解了数据结构的完整复杂性,但上面的代码确实重现了您的示例,因此它应该为您提供一些工作。

    <强>更新

    好的,现在我已经看到了你的评论和实际数据,下面是一些修改后的代码应该更快,并处理显然“空”单元实际上是空字符串的事实。

    首先创建一个名为 cItem 的类,然后添加代码:

    Option Explicit
    
    Public Name As String
    Public Frq As Long
    Public Adj As Collection
    Private mChainIndex As Long
    Public Property Get ChainIndex() As Long
        ChainIndex = mChainIndex
    End Property
    Public Property Let ChainIndex(val As Long)
        Dim oItem As cItem
        If mChainIndex = 0 Then
            mChainIndex = val
            For Each oItem In Me.Adj
                oItem.ChainIndex = val
            Next
        End If            
    End Property
    Public Sub AddAdj(oAdj As cItem)
        Dim t As cItem
    
        On Error Resume Next
        Set t = Me.Adj(oAdj.Name)
        On Error GoTo 0
        If t Is Nothing Then Me.Adj.Add oAdj, oAdj.Name
    End Sub
    Private Sub Class_Initialize()
        Set Adj = New Collection
    End Sub
    

    现在创建另一个名为 cChain 的类,代码隐藏为:

    Option Explicit
    
    Public Index As Long
    Public Members As Collection
    Public Sub AddItem(oItem As cItem)
        Dim oChainItem As cItem
        With Me.Members
            Select Case .Count
                Case 0 'First item so just add it.
                    .Add oItem, oItem.Name
                Case Is < 12 'Fewer than 12 items, so add to end or in order.
                    Set oChainItem = .item(.Count)
                    If oItem.Frq <= oChainItem.Frq Then 'It's last in order so just add it.
                        .Add oItem, oItem.Name
                    Else 'Find its place in order.
                        For Each oChainItem In Me.Members
                            If oItem.Frq > oChainItem.Frq Then
                                .Add oItem, oItem.Name, before:=oChainItem.Name
                                Exit For
                            End If
                        Next
                    End If
                Case 12 'Full list, so find place and remove last item.
                    Set oChainItem = .item(12)
                    If oItem.Frq > oChainItem.Frq Then
                        For Each oChainItem In Me.Members
                            If oItem.Frq > oChainItem.Frq Then
                                .Add oItem, oItem.Name, before:=oChainItem.Name
                                .Remove 13
                                Exit For
                            End If
                        Next
                    End If
            End Select
        End With
    End Sub
    Private Sub Class_Initialize()
        Set Members = New Collection
    End Sub
    

    最后,您的模块代码将是:

    Option Explicit
    
    Public Sub ProcessSheet()
        Dim data As Variant
        Dim items As Collection, chains As Collection
        Dim oItem As cItem, oAdj As cItem
        Dim oChain As cChain
        Dim txt As String
        Dim r As Long, c As Long, n As Long
        Dim output() As Variant
        Dim pTick As Long, pCount As Long, pTot As Long, pTask As String
    
        'Read the data.
        pTask = "Reading data..."
        Application.StatusBar = pTask
        With Sheet1
            data = .Range(.Cells(2, "A"), _
                          .Cells(.Rows.Count, "A").End(xlUp)) _
                   .Resize(, 12) _
                   .Value2
        End With
    
        'Collect unique and adjacent values.
        pTask = "Finding uniques "
        pCount = 0: pTot = UBound(data, 1): pTick = 0
        Set items = New Collection
        For r = 1 To UBound(data, 1)
            If ProgressTicked(pTot, pCount, pTick) Then
                Application.StatusBar = pTask & pTick & "%"
                DoEvents
            End If
            For c = 1 To UBound(data, 2)
                txt = data(r, c)
                If Len(txt) = 0 Then Exit For
                Set oItem = GetOrCreateItem(items, txt)
                oItem.Frq = oItem.Frq + 1
    
                'Take adjacent on left.
                If c > 1 Then
                    txt = data(r, c - 1)
                    If Len(txt) > 0 Then
                        Set oAdj = GetOrCreateItem(items, txt)
                        oItem.AddAdj oAdj
                    End If
                End If
                'Take adjacent on right.
                If c < UBound(data, 2) Then
                    txt = data(r, c + 1)
                    If Len(txt) > 0 Then
                        Set oAdj = GetOrCreateItem(items, txt)
                        oItem.AddAdj oAdj
                    End If
                End If
    
            Next
        Next
    
        'Now that we have all the items and their frequencies,
        'we can find the adjacent chain indexes by a recursive
        'call of the ChainIndex set property.
        pTask = "Find chain indexes "
        pCount = 0: pTot = items.Count: pTick = 0
        Set chains = New Collection
        n = 1 'Chain index.
        For Each oItem In items
            If ProgressTicked(pTot, pCount, pTick) Then
                Application.StatusBar = pTask & pTick & "%"
                DoEvents
            End If
            If oItem.ChainIndex = 0 Then
                oItem.ChainIndex = n
                Set oChain = New cChain
                oChain.Index = n
                chains.Add oChain, CStr(n)
                n = n + 1
            End If
        Next
    
        'Build the chains.
        pTask = "Build chains "
        pCount = 0: pTot = items.Count: pTick = 0
        For Each oItem In items
            If ProgressTicked(pTot, pCount, pTick) Then
                Application.StatusBar = pTask & pTick & "%"
                DoEvents
            End If
            Set oChain = chains(CStr(oItem.ChainIndex))
            oChain.AddItem oItem
        Next
    
        'Write the data to our output array.
        pTask = "Populate output "
        pCount = 0: pTot = UBound(data, 1): pTick = 0
        ReDim output(1 To UBound(data, 1), 1 To 12)
        For r = 1 To UBound(data, 1)
            If ProgressTicked(pTot, pCount, pTick) Then
                Application.StatusBar = pTask & pTick & "%"
                DoEvents
            End If
            Set oItem = items(data(r, 1))
            Set oChain = chains(CStr(oItem.ChainIndex))
            c = 1
            For Each oItem In oChain.Members
                output(r, c) = oItem.Name
                c = c + 1
            Next
        Next
    
        'Write the output to sheet.
        'Note: adjust range to suit.
        pTask = "Writing data..."
        Application.StatusBar = pTask
        Sheet1.Range("M2").Resize(UBound(output, 1), UBound(output, 2)).Value = output
        Application.StatusBar = "Ready"
    End Sub
    
    Private Function GetOrCreateItem(col As Collection, key As String) As cItem
        Dim obj As cItem
    
        'If the item already exists then return it,
        'otherwise create a new item.
        On Error Resume Next
        Set obj = col(key)
        On Error GoTo 0
    
        If obj Is Nothing Then
            Set obj = New cItem
            obj.Name = key
            col.Add obj, key
        End If
    
        Set GetOrCreateItem = obj
    
    End Function
    Public Function ProgressTicked(ByVal t As Long, ByRef c As Long, ByRef p As Long) As Boolean
        c = c + 1
        If Int((c / t) * 100) > p Then
            p = p + 1
            ProgressTicked = True
        End If
    End Function