删除基于R,dplyr中数据不连续的行

时间:2017-04-24 04:54:17

标签: r dplyr

我有一个数据框,我试图删除年份不连续的行。

以下是我的数据框的示例:

         Name       Year Position Year_diff  FBv     ind1  velo_diff
1     Aaron Heilman 2005       RP         2  90.1    TRUE      0.0
2     Aaron Heilman 2003       SP         NA 89.4      NA      0.0 
3     Aaron Laffey  2010       RP         1  86.8    TRUE     -0.6 
4     Aaron Laffey  2009       SP         NA 87.4      NA      0.0
5     Alexi Ogando  2015       RP         2  94.5    TRUE      0.0
6     Alexi Ogando  2013       SP         NA 93.4   FALSE      0.0
7     Alexi Ogando  2012       RP         1  97.0    TRUE      1.9
8     Alexi Ogando  2011       SP         NA 95.1      NA      0.0

预期输出应为:

          Name      Year  Position Year_diff  FBv    ind1   velo_diff
3     Aaron Laffey  2010       RP         1   86.8    TRUE    -0.6
4     Aaron Laffey  2009       SP         NA  87.4      NA     0.0
7     Alexi Ogando  2012       RP         1   97.0    TRUE     1.9
8     Alexi Ogando  2011       SP         NA  95.1      NA     0.0

Alexique Ogando 2011-2012仍然存在的原因是因为他SPRP的序列符合连续几年的要求。 Ogando的2013-2015 SPRP序列连续几年没有达到。

一个可能有帮助的元素是年代不是连续的每个序列,velo_diff将是0.0

有人知道怎么做吗?所有帮助表示赞赏。

2 个答案:

答案 0 :(得分:1)

您可以执行分组filter,检查后续年份或上一年是否存在,以及Position是否相应匹配:

library(dplyr)

df <- read.table(text = 'Name       Year Position Year_diff  FBv     ind1  velo_diff
1     "Aaron Heilman" 2005       RP         2  90.1    TRUE      0.0
2     "Aaron Heilman" 2003       SP         NA 89.4      NA      0.0 
3     "Aaron Laffey"  2010       RP         1  86.8    TRUE     -0.6 
4     "Aaron Laffey"  2009       SP         NA 87.4      NA      0.0
5     "Alexi Ogando"  2015       RP         2  94.5    TRUE      0.0
6     "Alexi Ogando"  2013       SP         NA 93.4   FALSE      0.0
7     "Alexi Ogando"  2012       RP         1  97.0    TRUE      1.9
8     "Alexi Ogando"  2011       SP         NA 95.1      NA      0.0', header = TRUE)

df %>% group_by(Name) %>% 
    filter(((Year - 1) %in% Year & Position == 'RP') | 
           ((Year + 1) %in% Year & Position == 'SP'))

#> Source: local data frame [4 x 7]
#> Groups: Name [2]
#> 
#>           Name  Year Position Year_diff   FBv  ind1 velo_diff
#>         <fctr> <int>   <fctr>     <int> <dbl> <lgl>     <dbl>
#> 1 Aaron Laffey  2010       RP         1  86.8  TRUE      -0.6
#> 2 Aaron Laffey  2009       SP        NA  87.4    NA       0.0
#> 3 Alexi Ogando  2012       RP         1  97.0  TRUE       1.9
#> 4 Alexi Ogando  2011       SP        NA  95.1    NA       0.0

答案 1 :(得分:1)

我们可以使用data.table

library(data.table)
setDT(df1)[df1[, .I[abs(diff(Year))==1], .(Name, grp  = cumsum(Position == "RP"))]$V1]
#           Name Year Position Year_diff  FBv ind1 velo_diff
#1: Aaron Laffey 2010       RP         1 86.8 TRUE      -0.6
#2: Aaron Laffey 2009       SP        NA 87.4   NA       0.0
#3: Alexi Ogando 2012       RP         1 97.0 TRUE       1.9
#4: Alexi Ogando 2011       SP        NA 95.1   NA       0.0

或使用与dplyr

相同的方法
library(dplyr)
df1 %>%
   group_by(Name, grp = cumsum(Position == "RP")) %>%  
   filter(abs(diff(Year))==1) %>% #below 2 steps may not be needed
   ungroup() %>%
   select(-grp)
# A tibble: 4 × 7
#           Name  Year Position Year_diff   FBv  ind1 velo_diff
#          <chr> <int>    <chr>     <int> <dbl> <lgl>     <dbl>
#1 Aaron Laffey  2010       RP         1  86.8  TRUE      -0.6
#2 Aaron Laffey  2009       SP        NA  87.4    NA       0.0
#3 Alexi Ogando  2012       RP         1  97.0  TRUE       1.9
#4 Alexi Ogando  2011       SP        NA  95.1    NA       0.0