在R

时间:2015-04-22 12:24:37

标签: r subset

这是我的数据集的一个子集,在几个实验中测量变量elevated

        Experiment.Name Sampling.Year   elevated
3409       Swiss Jura_c          1999   17.30000
3410       Swiss Jura_c          1999    9.10000
3411 SwissFACE_lolium_c          2000   -1.45545
3412 SwissFACE_lolium_c          2000   -2.94843
3413 SwissFACE_lolium_c          2000   -3.74132
3414 SwissFACE_lolium_c          2000   -1.42080
3461              DRI_c          1993  122.87900
3462              DRI_c          1993   13.71500
3463              DRI_c          1993    0.91800
3464              DRI_c          1993    1.29800
3465              DRI_c          1993    2.43600
3466              DRI_c          1993    3.46600
3467              DRI_c          1994    0.42700
3469              DRI_c          1994    1.74100
3470              DRI_c          1994    1.01700
3471              DRI_c          1994    2.38300
3640 Bonanza Creek_pb_f          2001 3222.00000
3641 Bonanza Creek_pg_f          2001 3455.00000
3665    Fork Mountain_f          2000    0.24900
3669    Fork Mountain_f          2001    0.23100
4037            KFFL_wh          2003   42.07000

我想对整个数据集进行子集化,因此我只保留那些包含elevated测量值超过一年的实验。例如,在上表中,我将排除与Swiss Jura_c实验相对应的行,因为它只有一年的测量值:1999。但是,我会包含与DRI_c实验相对应的行,因为它包含超过一年的测量值:19931994。如何在R中实现这样的子集选择? 感谢

2 个答案:

答案 0 :(得分:3)

尝试

library(data.table)
setDT(df1)[, .SD[uniqueN(Sampling.Year)>1], Experiment.Name]

或者

library(dplyr)
 df1 %>% 
    group_by(Experiment.Name) %>% 
    filter(n_distinct(Sampling.Year)>1)

数据

df1 <- structure(list(Experiment.Name = c("Swiss Jura_c",
"Swiss Jura_c", 
"SwissFACE_lolium_c", "SwissFACE_lolium_c", "SwissFACE_lolium_c", 
"SwissFACE_lolium_c", "DRI_c", "DRI_c", "DRI_c", "DRI_c", "DRI_c", 
"DRI_c", "DRI_c", "DRI_c", "DRI_c", "DRI_c", "Bonanza Creek_pb_f", 
"Bonanza Creek_pg_f", "Fork Mountain_f", "Fork Mountain_f", "KFFL_wh"
), Sampling.Year = c(1999L, 1999L, 2000L, 2000L, 2000L, 2000L, 
1993L, 1993L, 1993L, 1993L, 1993L, 1993L, 1994L, 1994L, 1994L, 
1994L, 2001L, 2001L, 2000L, 2001L, 2003L), elevated = c(17.3, 
9.1, -1.45545, -2.94843, -3.74132, -1.4208, 122.879, 13.715, 
0.918, 1.298, 2.436, 3.466, 0.427, 1.741, 1.017, 2.383, 3222, 
3455, 0.249, 0.231, 42.07)), .Names = c("Experiment.Name", 
"Sampling.Year", 
"elevated"), row.names = c(3409L, 3410L, 3411L, 3412L, 3413L, 
3414L, 3461L, 3462L, 3463L, 3464L, 3465L, 3466L, 3467L, 3469L, 
3470L, 3471L, 3640L, 3641L, 3665L, 3669L, 4037L), class = "data.frame")

答案 1 :(得分:1)

或使用基础R:

a <- aggregate(Sampling.Year ~ Experiment.Name, data=df1, function(x) length(unique(x)))
df1[which(df1$Experiment.Name %in% a$Experiment.Name[which(a$Sampling.Year > 1)]),]]