根据规范标记重复(非唯一)值

时间:2013-10-04 16:41:31

标签: r data.table

enter code here我正在处理数据集“Final.Export”,如下所示:

            LakeID   LakeName SourceVariableName SourceVariableDescription SourceFlags
47    390 Moosehead         Acolor(PCU)            Apparent color        <NA>
48    390 Moosehead         Acolor(PCU)            Apparent color        <NA>
49    390 Moosehead         Acolor(PCU)            Apparent color        <NA>
50    390 Moosehead         Acolor(PCU)            Apparent color        <NA>
51    390 Moosehead         Acolor(PCU)            Apparent color        <NA>
52    390 Moosehead         Acolor(PCU)            Apparent color        <NA>
53    390 Moosehead         Acolor(PCU)            Apparent color        <NA>
54    390 Moosehead         Acolor(PCU)            Apparent color        <NA>
55    390 Moosehead         Acolor(PCU)            Apparent color        <NA>
56    390 Moosehead         Acolor(PCU)            Apparent color        <NA>
   LagosVariableID LagosVariableName Value Units CensorCode DetectionLimit       Date
47              11   Color, apparent    22   PCU         NC             NA 2003-08-26
48              11   Color, apparent    17   PCU         NC             NA 2003-08-26
49              11   Color, apparent    16   PCU         NC             NA 2003-08-26
50              11   Color, apparent    14   PCU         NC             NA 2003-08-26
51              11   Color, apparent    14   PCU         NC             NA 2003-08-26
52              11   Color, apparent    17   PCU         NC             NA 2003-08-26
53              11   Color, apparent    16   PCU         NC             NA 2003-08-26
54              11   Color, apparent    17   PCU         NC             NA 2003-08-26
55              11   Color, apparent    14   PCU         NC             NA 2003-08-26
56              11   Color, apparent    17   PCU         NC             NA 2003-08-26
   LabMethodName LabMethodInfo SampleType SamplePosition SampleDepth MethodInfo
47          <NA>          <NA> INTEGRATED      SPECIFIED           6       <NA>
48          <NA>          <NA> INTEGRATED      SPECIFIED           7       <NA>
49          <NA>          <NA> INTEGRATED      SPECIFIED           6       <NA>
50          <NA>          <NA> INTEGRATED      SPECIFIED          10       <NA>
51          <NA>          <NA> INTEGRATED      SPECIFIED          10       <NA>
52          <NA>          <NA> INTEGRATED      SPECIFIED           9       <NA>
53          <NA>          <NA> INTEGRATED      SPECIFIED          10       <NA>
54          <NA>          <NA> INTEGRATED      SPECIFIED           8       <NA>
55          <NA>          <NA> INTEGRATED      SPECIFIED          10       <NA>
56          <NA>          <NA> INTEGRATED      SPECIFIED          10       <NA>
   BasinType Subprogram Comments Dup
47   UNKNOWN         NA       NA  NA
48   UNKNOWN         NA       NA  NA
49   UNKNOWN         NA       NA  NA
50   UNKNOWN         NA       NA  NA
51   UNKNOWN         NA       NA  NA
52   UNKNOWN         NA       NA  NA
53   UNKNOWN         NA       NA  NA
54   UNKNOWN         NA       NA  NA
55   UNKNOWN         NA       NA  NA
56   UNKNOWN         NA       NA  NA

我想将所有重复值标记为1.重复值定义为在'LakeID','Date','LagosVariableID','SampleDepth'和'SamplePosition'列的每一列中具有完全相同值的值

为此,我使用以下代码创建了一个新的数据表“data1”:

library(data.table)
data1=data.table(Final.Export,key=c('LakeID','Date','LagosVariableID','SampleDepth','SamplePosition','Value'))
data1=data1[,Dup:=duplicated(.SD),.SDcols=c('LakeID','Date', 'LagosVariableID', 'SampleDepth', 'SamplePosition','Value')]
data1$Dup[which(data1$Dup==FALSE)]=NA
data1$Dup[which(data1$Dup==TRUE)]=1

“data1”的问题是,在第一个唯一行(标记为NA)之后,只有重复的行(根据我的重复定义)被标记为“1”。我需要将唯一行和关联的重复行标记为“1”。任何想法如何做到这一点?

如果这令人困惑,请告诉我如何澄清。

1 个答案:

答案 0 :(得分:1)

如果没有可重复的例子,很难说,但似乎你想要这样的东西:

data1[,dup:=duplicated(.SD), 
       by=list(LakeID, LagosVariableID, Value, Date, SamplePosition, SampleDepth)]

修改

在OP澄清后,他们似乎只是想要这个:

data1[,dup:=duplicated(.SD), 
 .SDcols=c('LakeID', 'Date', 'LagosVariableID', 'SampleDepth', 'SamplePosition')]