R和dplyr中的组滞后/领先

时间:2016-06-23 14:00:36

标签: r dplyr

我在尝试延迟按团队分组的日期时遇到了麻烦。

数据:

 df <- data.frame(Team = c("A", "A", "A", "A", "B", "B", "B", "C", "C", "D", "D"),
             Date = c("2016-05-10","2016-05-10", "2016-05-10", "2016-05-10",
                      "2016-05-12", "2016-05-12", "2016-05-12",
                      "2016-05-15","2016-05-15",
                      "2016-05-30", "2016-05-30"), 
             Points = c(1,4,3,2,1,5,6,1,2,3,9)
             )

Team      Date       Points
 A     2016-05-10      1
 A     2016-05-10      4
 A     2016-05-10      3
 A     2016-05-10      2
 B     2016-05-12      1
 B     2016-05-12      5
 B     2016-05-12      6
 C     2016-05-15      1
 C     2016-05-15      2
 D     2016-05-30      3
 D     2016-05-30      9

预期结果:

Team      Date       Points   Date_Lagged
 A     2016-05-10      1          NA
 A     2016-05-10      4          NA
 A     2016-05-10      3          NA
 A     2016-05-10      2          NA
 B     2016-05-12      1      2016-05-10 
 B     2016-05-12      5      2016-05-10 
 B     2016-05-12      6      2016-05-10 
 C     2016-05-15      1      2016-05-12
 C     2016-05-15      2      2016-05-12
 D     2016-05-30      3      2016-05-15
 D     2016-05-30      9      2016-05-15

在我意识到以下不是正确的解决方案后,我抓挠了头:

df %>% group_by(Date) %>% mutate(Date_lagged = lag(Date))  

知道怎么解决吗?

2 个答案:

答案 0 :(得分:7)

lag默认偏移n=1。但是,我们为“团队”和“日期”提供了重复的元素。为了获得预期的输出,我们需要获得distinct行&#39; Team&#39;,&#39; Date&#39;,创建一个&#39; Date_lagged&#39;使用lag&#39;日期&#39;和right_join(或left_join)原始数据集。

distinct(df, Team, Date) %>%
        mutate(Date_Lagged = lag(Date)) %>%
        right_join(., df) %>%
        select(Team, Date, Points, Date_Lagged)
#   Team       Date Points Date_Lagged
#1     A 2016-05-10      1        <NA>
#2     A 2016-05-10      4        <NA>
#3     A 2016-05-10      3        <NA>
#4     A 2016-05-10      2        <NA>
#5     B 2016-05-12      1  2016-05-10
#6     B 2016-05-12      5  2016-05-10
#7     B 2016-05-12      6  2016-05-10
#8     C 2016-05-15      1  2016-05-12
#9     C 2016-05-15      2  2016-05-12
#10    D 2016-05-30      3  2016-05-15
#11    D 2016-05-30      9  2016-05-15

或者我们也可以

df %>% 
    mutate(Date_Lagged = rep(lag(unique(Date)), table(Date)))

答案 1 :(得分:3)

您也可以使用基数R执行此操作,例如使用rle

with(rle(as.character(df$Date)), rep(c(NA, head(values, -1)), lengths))
# [1] NA           NA           NA           NA           "2016-05-10" "2016-05-10"
# [7] "2016-05-10" "2016-05-12" "2016-05-12" "2016-05-15" "2016-05-15"