选择列更改超过阈值的行

时间:2018-01-22 09:50:46

标签: r dplyr

我有一个包含三列的数据框,其中第一列是ID,第二列表示年份,第三列是与该年度ID相关联的值:

df.in <- data.frame("id"=c(1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3,3),
                    "yr"=c(2005,2006,2007,2008,2010, 2001,2002,2003,2006,2008,2009, 2001, 2002,2003,2004,2005,2007,2009),
                    "val"=c(5,6,7,8,10, 1,2,3,6,8,10, 1,2,3,4,5,7,9))

我想删除年份与上一年大于1的差距的行。换句话说,我想只保留数据中的那些行,其中年份以1为增量相互跟随:

df.out <- data.frame("id"=c(1,1,1,1, 2,2,2, 3,3,3,3,3),
                     "yr"=c(2005,2006,2007,2008, 2001,2002,2003,2001, 2002,2003,2004,2005),
                     "val"=c(5,6,7,8, 1,2,3, 1,2,3,4,5))

有没有办法在使用dplyr的R中执行此操作?如果可能的话,我想要一个包含所有废弃年份的数据框:

df.discard <- data.frame("id"=c(1, 2,2, 3,3),
                         "yr"=c(2010, 2006, 2008,2009, 2007,2009),
                         "val"=c(10, 6, 8,10, 7,9))

1 个答案:

答案 0 :(得分:3)

使用lag

根据您的规则过滤掉
df.in %>% filter(val - lag(val) > 1)

基于@Sotos和@akrun,将代码从使用val更改为yr

df.in <- data.frame("id"=c(1,1,1,1,1, 2,2,2,2,2,2, 3,3,3,3,3,3,3),
                    "yr"=c(2005,2006,2007,2008,2010, 2001,2002,2003,2006,2008,2010, 2001, 2002,2003,2004,2005,2007,2009),
                    "val"=c(5,6,7,8,10, 1,2,3,6,8,10, 1,2,3,4,5,7,9))

df.out <- data.frame("id"=c(1,1,1,1, 2,2,2,2, 3,3,3,3,3),
                     "yr"=c(2005,2006,2007,2008, 2001,2002,2003,2006,2001, 2002,2003,2004,2005),
                     "val"=c(5,6,7,8, 1,2,3,6, 1,2,3,4,5))


#output

df.out <- df.in %>% group_by(id) %>% filter((yr - lag(yr, default = yr[1]) <= 1))

df.out

#ignored

df.ignored <- df.in %>% group_by(id) %>% filter((yr - lag(yr, default = yr[1]) > 1))

df.ignored

输出:

> df.out
# A tibble: 12 x 3
# Groups: id [3]
      id    yr   val
   <dbl> <dbl> <dbl>
 1  1.00  2005  5.00
 2  1.00  2006  6.00
 3  1.00  2007  7.00
 4  1.00  2008  8.00
 5  2.00  2001  1.00
 6  2.00  2002  2.00
 7  2.00  2003  3.00
 8  3.00  2001  1.00
 9  3.00  2002  2.00
10  3.00  2003  3.00
11  3.00  2004  4.00
12  3.00  2005  5.00
> df.ignored
# A tibble: 6 x 3
# Groups: id [3]
     id    yr   val
  <dbl> <dbl> <dbl>
1  1.00  2010 10.0 
2  2.00  2006  6.00
3  2.00  2008  8.00
4  2.00  2010 10.0 
5  3.00  2007  7.00
6  3.00  2009  9.00