大数据集的唯一计数公式

时间:2015-08-20 18:24:09

标签: excel vba unique uniqueidentifier distinct-values

我无法确定在相邻单元格中输入10以指示在处理大型数据集时值是否唯一的方法。我已经阅读了多种方法来实现这一点,但是对于我的目的来说,它们似乎都没有效率:我使用的是Excel 2010实例(所以我拥有 Distinct Count

在这个StackOverflow问题中:Simple Pivot Table to Count Unique Values有使用SUMPRODUCTCOUNTIF的建议,但是当我处理50,000多行时,这会导致糟糕的性能和文件大小~35 MB而不是~3 MB。我想知道对于大型动态数据集是否有更好的解决方案,无论是公式还是VBA。

我想要完成的一个例子是( Unique 列是相邻的单元格):

Name   Week   Unique
John   1      1
Sally  1      1
John   1      0
Sally  2      1

我尝试编写COUNTIF的相同功能,但没有成功:

For Each Cell In ThisWorkbook.Worksheets("Overtime & Type Data").Range("Z2:Z" & DataLastRow)
If Worksheets("Overtime & Type Data").Cells(Cell.Row, 26) <> Worksheets("Overtime & Type Data").Cells(Cell.Row - 1, 26) Then
FirstCell = Cell.Row
End If
If (Worksheets("Overtime & Type Data").Range(Cells(FirstCell, 26), Cells(Cell.Row, 26)) = Worksheets("Overtime & Type Data").Range(Cells(Cell.Row, 26))) = True Then
    Cell.Value = 1
Else
    Cell.Value = 0
End If
Next Cell

4 个答案:

答案 0 :(得分:2)

此代码在不到3秒的时间内成功运行了超过130,000行。调整列字母以适合您的数据集。

Sub tgr()

    Const colName As String = "A"
    Const colWeek As String = "B"
    Const colOutput As String = "C"

    Dim ws As Worksheet
    Dim rngData As Range
    Dim DataCell As Range
    Dim rngFound As Range
    Dim collUniques As Collection
    Dim arrResults() As Long
    Dim ResultIndex As Long
    Dim UnqCount As Long

    Set ws = ThisWorkbook.Sheets("Overtime & Type Data")
    Set rngData = ws.Range(colName & 2, ws.Cells(Rows.Count, colName).End(xlUp))
    Set collUniques = New Collection
    ReDim arrResults(1 To rngData.Cells.Count, 1 To 1)

    On Error Resume Next
    For Each DataCell In rngData.Cells
        ResultIndex = ResultIndex + 1
        collUniques.Add ws.Cells(DataCell.Row, colName).Value & ws.Cells(DataCell.Row, colWeek).Value, ws.Cells(DataCell.Row, colName).Value & ws.Cells(DataCell.Row, colWeek).Value
        If collUniques.Count > UnqCount Then
            UnqCount = collUniques.Count
            arrResults(ResultIndex, 1) = 1
        Else
            arrResults(ResultIndex, 1) = 0
        End If
    Next DataCell
    On Error GoTo 0

    ws.Cells(rngData.Row, colOutput).Resize(rngData.Cells.Count).Value = arrResults

End Sub

答案 1 :(得分:0)

一种方法是按名称和周排序。然后,您可以通过与前一行进行比较来确定任何行的唯一性。

如果您需要保留订单,可以先写一列索引号(1,2,3,...)来跟踪订单。计算唯一后,按索引排序以恢复原始订单。

整个过程可以通过相对较少的步骤手动完成,也可以使用VBA自动完成。

答案 2 :(得分:0)

我不确定这对50000值的效果如何,但它在大约一秒钟内会达到~1500。

Sub unique()
    Dim myColl As New Collection
    Dim isDup As Boolean
    Dim myValue As String
    Dim r As Long

    On Error GoTo DuplicateValue
    For r = 1 To Sheet1.UsedRange.Rows.Count
        isDup = False
        'Combine the value of the 2 cells together
        ' and add that string to our collection
        'If it is already in the collection it errors
        myValue = Sheet1.Cells(r, 1).Value & Sheet1.Cells(r, 2).Value
        myColl.Add r, myValue
        If isDup Then
            Sheet1.Cells(r, 3).Value = "0"
        Else
            Sheet1.Cells(r, 3).Value = "1"
        End If
    Next
    On Error GoTo 0
    Exit Sub
DuplicateValue:
    'The value is already in the collection so put a 0
    isDup = True
    Resume Next
End Sub

答案 3 :(得分:0)

几乎任何批量操作都会击败涉及工作表单元格的循环。您可以通过在内存中执行所有计算来缩短时间,并在完成后仅将值返回到工作表 en masse

Sub is_a_dupe()
    Dim v As Long, vTMP As Variant, vUNQs As Variant, dUNQs As Object

    Debug.Print Timer
    On Error GoTo bm_Uh_Oh
    Set dUNQs = CreateObject("Scripting.Dictionary")

    With Worksheets("Sheet1")

        vTMP = .Range(.Cells(2, 1), .Cells(Rows.Count, 2).End(xlUp)).Value2
        ReDim vUNQs(1 To UBound(vTMP, 1), 1 To 1)

        For v = LBound(vTMP, 1) To UBound(vTMP, 1)
            If dUNQs.Exists(Join(Array(vTMP(v, 1), vTMP(v, 2)))) Then
                vUNQs(v, 1) = 0
            Else
                dUNQs.Add Key:=Join(Array(vTMP(v, 1), vTMP(v, 2))), _
                          Item:=vTMP(v, 2)
                vUNQs(v, 1) = 1
            End If
        Next v

        .Cells(2, 3).Resize(UBound(vUNQs, 1), 1) = vUNQs

    End With

    Debug.Print Timer

bm_Uh_Oh:
    dUNQs.RemoveAll
    Set dUNQs = Nothing
End Sub

以前的经验告诉我,各种各样的数据(以及硬件等)会影响流程的时间安排,但在随机抽样数据中,我收到了这些经过的时间。

  

50K记录..... 0.53秒
   130K记录...... 1.32秒
   500K记录.... 4.92秒