我有一个数据框,我试图删除年份不连续的行。
以下是我的数据框的示例:
Name Year Position Year_diff FBv ind1 velo_diff
1 Aaron Heilman 2005 RP 2 90.1 TRUE 0.0
2 Aaron Heilman 2003 SP NA 89.4 NA 0.0
3 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
4 Aaron Laffey 2009 SP NA 87.4 NA 0.0
5 Alexi Ogando 2015 RP 2 94.5 TRUE 0.0
6 Alexi Ogando 2013 SP NA 93.4 FALSE 0.0
7 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
8 Alexi Ogando 2011 SP NA 95.1 NA 0.0
预期输出应为:
Name Year Position Year_diff FBv ind1 velo_diff
3 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
4 Aaron Laffey 2009 SP NA 87.4 NA 0.0
7 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
8 Alexi Ogando 2011 SP NA 95.1 NA 0.0
Alexique Ogando 2011-2012仍然存在的原因是因为他SP
到RP
的序列符合连续几年的要求。 Ogando的2013-2015 SP
到RP
序列连续几年没有达到。
一个可能有帮助的元素是年代不是连续的每个序列,velo_diff将是0.0
有人知道怎么做吗?所有帮助表示赞赏。
答案 0 :(得分:1)
您可以执行分组filter
,检查后续年份或上一年是否存在,以及Position
是否相应匹配:
library(dplyr)
df <- read.table(text = 'Name Year Position Year_diff FBv ind1 velo_diff
1 "Aaron Heilman" 2005 RP 2 90.1 TRUE 0.0
2 "Aaron Heilman" 2003 SP NA 89.4 NA 0.0
3 "Aaron Laffey" 2010 RP 1 86.8 TRUE -0.6
4 "Aaron Laffey" 2009 SP NA 87.4 NA 0.0
5 "Alexi Ogando" 2015 RP 2 94.5 TRUE 0.0
6 "Alexi Ogando" 2013 SP NA 93.4 FALSE 0.0
7 "Alexi Ogando" 2012 RP 1 97.0 TRUE 1.9
8 "Alexi Ogando" 2011 SP NA 95.1 NA 0.0', header = TRUE)
df %>% group_by(Name) %>%
filter(((Year - 1) %in% Year & Position == 'RP') |
((Year + 1) %in% Year & Position == 'SP'))
#> Source: local data frame [4 x 7]
#> Groups: Name [2]
#>
#> Name Year Position Year_diff FBv ind1 velo_diff
#> <fctr> <int> <fctr> <int> <dbl> <lgl> <dbl>
#> 1 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
#> 2 Aaron Laffey 2009 SP NA 87.4 NA 0.0
#> 3 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
#> 4 Alexi Ogando 2011 SP NA 95.1 NA 0.0
答案 1 :(得分:1)
我们可以使用data.table
library(data.table)
setDT(df1)[df1[, .I[abs(diff(Year))==1], .(Name, grp = cumsum(Position == "RP"))]$V1]
# Name Year Position Year_diff FBv ind1 velo_diff
#1: Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
#2: Aaron Laffey 2009 SP NA 87.4 NA 0.0
#3: Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
#4: Alexi Ogando 2011 SP NA 95.1 NA 0.0
或使用与dplyr
library(dplyr)
df1 %>%
group_by(Name, grp = cumsum(Position == "RP")) %>%
filter(abs(diff(Year))==1) %>% #below 2 steps may not be needed
ungroup() %>%
select(-grp)
# A tibble: 4 × 7
# Name Year Position Year_diff FBv ind1 velo_diff
# <chr> <int> <chr> <int> <dbl> <lgl> <dbl>
#1 Aaron Laffey 2010 RP 1 86.8 TRUE -0.6
#2 Aaron Laffey 2009 SP NA 87.4 NA 0.0
#3 Alexi Ogando 2012 RP 1 97.0 TRUE 1.9
#4 Alexi Ogando 2011 SP NA 95.1 NA 0.0