Question

我了解如何检索TRUE和FALSES的新数据帧（df）以进行重复标识，但是我希望使用新列返回的大致相同的df能够识别行是否重复。可以是附加到先前df或整个新df的标识符。

请注意，我的df有超过2000万条记录。另请注意，我的df只有两列。一种测试仅在一列上查找重复项。另一个测试将在列的组合中查找重复项。

谢谢。

可复制的数据

lRow = Sheets("Source1").Cells(Sheets("Source1").Rows.Count, 1).End(xlUp).Row
lCol = Sheets("Source1").Cells(1, Sheets("Source1").Columns.Count).End(xlToLeft).Column

' Manipulate data, including new columns
Set PRange = Sheets("Source1").Cells(1, 1).Resize(lRow, lCol + 2)
Set PDest = Sheets("Source1 Summary")

Set PCache = ActiveWorkbook.PivotCaches.Create(SourceType:=xlDatabase, SourceData:= _
    PRange).CreatePivotTable(TableDestination:= _
    PDest.Cells(3, 1), TableName:="Source1Pivot")

' Set up the table with the data fields, all works perfectly

'For the new Pivot Table:
lRow = Sheets("Source2").Cells(Sheets("Source2").Rows.Count, 1).End(xlUp).Row
lCol = Sheets("Source2").Cells(1, Sheets("Source2").Columns.Count).End(xlToLeft).Column

Set PRange = Sheets("Source2").Cells(1, 1).Resize(lRow, lCol + 1)
Set PDest = Sheets("Source2 Summary")

Set PCache2 = ActiveWorkbook.PivotCaches.Create(SourceType:=xlDatabase, SourceData:= _
    PRange).CreatePivotTable(TableDestination:= _
    PDest.Cells(3, 1), TableName:="Source2Pivot")

Answer 1

您可以在任何duplicated上使用data.frame：

DT$new_col1 <- duplicated(DT) 
DT$new_col2 <- duplicated(DT$A) 

DT
#     A B C new_col1 new_col2
#  1: 1 1 1    FALSE    FALSE
#  2: 1 1 2    FALSE     TRUE
#  3: 1 1 1     TRUE     TRUE
#  4: 1 2 2    FALSE     TRUE
#  5: 2 2 1    FALSE    FALSE
#  6: 2 2 2    FALSE     TRUE
#  7: 2 3 1    FALSE     TRUE
#  8: 2 3 2    FALSE     TRUE
#  9: 3 3 1    FALSE    FALSE
# 10: 3 4 2    FALSE     TRUE
# 11: 3 4 1    FALSE     TRUE
# 12: 3 4 2     TRUE     TRUE

在使用data.table时，您可能希望使用data.table语法（感谢@Frank）：

DT[,new_col1:= duplicated(.SD)][,new_col2:= duplicated(A)]

FYI data.table有自己的duplicated方法，也可以按如下方式使用：

duplicated(DT, by="A")
# [1] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

请参见?data.table:::duplicated

我希望返回相同的数据帧，并确定重复项

1 个答案: