Question

我已经搜索过高低的解决方案，但我找不到一个......

我的数据框架（基本上是按日期划分的第1号运动队的表格）有很多场合，其中一个或多个团队将“重新出现”在数据中。我想把每个时期的开始（或结束）日期拉出来。每队1人。

数据的一个例子可能是：

x1<- as.Date("2013-12-31")
adddate1 <- 1:length(teams1)
dates1 <- x1 + adddate1
teams2 <- c(rep("w", 3), rep("c", 8), rep("w", 4))
x2<- as.Date("2012-12-31")
adddate2 <- 1:length(teams2)
dates2 <- x2 + adddate2
dates <- c(dates2, dates1)
teams <- c(teams2, teams1)
df <- data.frame(dates, teams)
df$year <- year(df$dates)

2013年的情况如下：

        dates teams year
1  2013-01-01     w 2013
2  2013-01-02     w 2013
3  2013-01-03     w 2013
4  2013-01-04     c 2013
5  2013-01-05     c 2013
6  2013-01-06     c 2013
7  2013-01-07     c 2013
8  2013-01-08     c 2013
9  2013-01-09     c 2013
10 2013-01-10     c 2013
11 2013-01-11     c 2013
12 2013-01-12     w 2013
13 2013-01-13     w 2013
14 2013-01-14     w 2013
15 2013-01-15     w 2013

但是，使用ddply聚合具有相同名称的团队并返回以下内容：

split <- ddply(df, .(year, teams), head,1)
split <- split[order(split[,1]),]

       dates teams year
2 2013-01-01     w 2013
1 2013-01-04     c 2013
3 2014-01-01     c 2014
4 2014-01-09     k 2014

有没有更优雅的方法来创建一个函数，它将通过原始df并为每个子集返回一个唯一值，将其添加到df然后使用ddply合并新的唯一值来返回我的要什么？

Answer 1

你说有些团队“重新出现”，那时我认为来自this answer的小intergroup辅助函数可能就是这里的正确工具。在您的情况下，有很多团队，例如“w”在同一年重新出现，例如2013年，又一支队伍在那里待了一段时间，例如“C”。现在，如果您想将每个团队的每个出现顺序视为单独的组，以获得该序列的第一个或最后一个日期，那么当此函数有用时。请注意，如果您按照通常的方式按“团队”和“年份”分组，则每个团队，例如“w”只能有一个第一个/最后一个日期（例如在dplyr中使用“summary”时）。

定义功能：

intergroup <- function(var, start = 1) {
  cumsum(abs(c(start, diff(as.numeric(as.factor(var))))))
}

现在，首先按年度对数据进行分组，然后使用团队列上的群组间功能进行分组：

library(dplyr)
df %>%
  group_by(year) %>%
  group_by(teamindex = intergroup(teams), add = TRUE) %>%
  filter(dense_rank(dates) == 1)

最后，您可以根据自己的需要进行过滤。例如，我过滤了最小日期。结果将是：

#Source: local data frame [3 x 4]
#Groups: year, teamindex
#
#       dates teams year teamindex
#1 2013-01-01     w 2013         1
#2 2013-01-04     c 2013         2
#3 2013-01-12     w 2013         3

请注意，团队“w”重新出现，因为我们使用intergroup函数创建的“teamindex”分组。

进行过滤的另一个选择是这样的（使用安排然后slice）：

df %>%
  group_by(year) %>%
  group_by(teamindex = intergroup(teams), add = TRUE) %>%
  arrange(dates) %>%
  slice(1)

我使用的数据来自akrun的答案。

Answer 2

您还可以使用rle创建teamindex。

library(dplyr)
 df %>% 
    group_by(year) %>% 
    group_by(teamindex= with(rle(teams),
          rep(seq_along(lengths), lengths)), add=TRUE) %>%
          filter(dates==min(dates)) #or #filter(dates==max(dates))

 #        dates teams year teamindex
 #1 2013-01-01     w 2013         1
 #2 2013-01-04     c 2013         2
 #3 2013-01-12     w 2013         3

或

df %>% 
   group_by(year) %>%
   group_by(teamindex= with(rle(teams),
      rep(seq_along(lengths), lengths)), add=TRUE) %>%
   arrange(dates) %>%
   slice(n()) #or #slice(1)
 #       dates teams year teamindex
 #1 2013-01-03     w 2013         1
 #2 2013-01-11     c 2013         2
 #3 2013-01-15     w 2013         3

数据

df <- structure(list(dates = structure(c(15706, 15707, 15708, 15709, 
15710, 15711, 15712, 15713, 15714, 15715, 15716, 15717, 15718, 
15719, 15720), class = "Date"), teams = c("w", "w", "w", "c", 
"c", "c", "c", "c", "c", "c", "c", "w", "w", "w", "w"), year = c(2013L, 
2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L, 
2013L, 2013L, 2013L, 2013L, 2013L)), .Names = c("dates", "teams", 
"year"), row.names = c("1", "2", "3", "4", "5", "6", "7", "8", 
"9", "10", "11", "12", "13", "14", "15"), class = "data.frame")

用重复序列子集化df

2 个答案:

数据