enter code here
我正在处理数据集“Final.Export”,如下所示:
LakeID LakeName SourceVariableName SourceVariableDescription SourceFlags
47 390 Moosehead Acolor(PCU) Apparent color <NA>
48 390 Moosehead Acolor(PCU) Apparent color <NA>
49 390 Moosehead Acolor(PCU) Apparent color <NA>
50 390 Moosehead Acolor(PCU) Apparent color <NA>
51 390 Moosehead Acolor(PCU) Apparent color <NA>
52 390 Moosehead Acolor(PCU) Apparent color <NA>
53 390 Moosehead Acolor(PCU) Apparent color <NA>
54 390 Moosehead Acolor(PCU) Apparent color <NA>
55 390 Moosehead Acolor(PCU) Apparent color <NA>
56 390 Moosehead Acolor(PCU) Apparent color <NA>
LagosVariableID LagosVariableName Value Units CensorCode DetectionLimit Date
47 11 Color, apparent 22 PCU NC NA 2003-08-26
48 11 Color, apparent 17 PCU NC NA 2003-08-26
49 11 Color, apparent 16 PCU NC NA 2003-08-26
50 11 Color, apparent 14 PCU NC NA 2003-08-26
51 11 Color, apparent 14 PCU NC NA 2003-08-26
52 11 Color, apparent 17 PCU NC NA 2003-08-26
53 11 Color, apparent 16 PCU NC NA 2003-08-26
54 11 Color, apparent 17 PCU NC NA 2003-08-26
55 11 Color, apparent 14 PCU NC NA 2003-08-26
56 11 Color, apparent 17 PCU NC NA 2003-08-26
LabMethodName LabMethodInfo SampleType SamplePosition SampleDepth MethodInfo
47 <NA> <NA> INTEGRATED SPECIFIED 6 <NA>
48 <NA> <NA> INTEGRATED SPECIFIED 7 <NA>
49 <NA> <NA> INTEGRATED SPECIFIED 6 <NA>
50 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
51 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
52 <NA> <NA> INTEGRATED SPECIFIED 9 <NA>
53 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
54 <NA> <NA> INTEGRATED SPECIFIED 8 <NA>
55 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
56 <NA> <NA> INTEGRATED SPECIFIED 10 <NA>
BasinType Subprogram Comments Dup
47 UNKNOWN NA NA NA
48 UNKNOWN NA NA NA
49 UNKNOWN NA NA NA
50 UNKNOWN NA NA NA
51 UNKNOWN NA NA NA
52 UNKNOWN NA NA NA
53 UNKNOWN NA NA NA
54 UNKNOWN NA NA NA
55 UNKNOWN NA NA NA
56 UNKNOWN NA NA NA
我想将所有重复值标记为1.重复值定义为在'LakeID','Date','LagosVariableID','SampleDepth'和'SamplePosition'列的每一列中具有完全相同值的值
为此,我使用以下代码创建了一个新的数据表“data1”:
library(data.table)
data1=data.table(Final.Export,key=c('LakeID','Date','LagosVariableID','SampleDepth','SamplePosition','Value'))
data1=data1[,Dup:=duplicated(.SD),.SDcols=c('LakeID','Date', 'LagosVariableID', 'SampleDepth', 'SamplePosition','Value')]
data1$Dup[which(data1$Dup==FALSE)]=NA
data1$Dup[which(data1$Dup==TRUE)]=1
“data1”的问题是,在第一个唯一行(标记为NA)之后,只有重复的行(根据我的重复定义)被标记为“1”。我需要将唯一行和关联的重复行标记为“1”。任何想法如何做到这一点?
如果这令人困惑,请告诉我如何澄清。
答案 0 :(得分:1)
如果没有可重复的例子,很难说,但似乎你想要这样的东西:
data1[,dup:=duplicated(.SD),
by=list(LakeID, LagosVariableID, Value, Date, SamplePosition, SampleDepth)]
修改强>
在OP澄清后,他们似乎只是想要这个:
data1[,dup:=duplicated(.SD),
.SDcols=c('LakeID', 'Date', 'LagosVariableID', 'SampleDepth', 'SamplePosition')]