我的数据如下:
ID CLASS START END
100 GA 3-Jan-15 1-Feb-15
100 G 1-Feb-15 22-Feb-15
100 GA 28-Feb-15 17-Mar-15
100 G 1-Apr-15 8-Apr-15
100 G 10-Apr-15 18-Apr-15
200 FA 3-Jan-14 1-Feb-14
200 FA 1-Feb-14 22-Feb-14
200 G 28-Feb-14 15-Mar-14
200 F 1-Apr-14 20-Apr-14
以下是数据:
df <- structure(list(ID = c(100L, 100L, 100L, 100L, 100L, 200L, 200L,
200L, 200L), CLASS = structure(c(4L, 3L, 4L, 3L, 3L, 2L, 2L,
3L, 1L), .Label = c("F", "FA", "G", "GA"), class = "factor"),
START = structure(c(9L, 4L, 7L, 2L, 5L, 8L, 3L, 6L, 1L), .Label = c("1-Apr-14",
"1-Apr-15", "1-Feb-14", "1-Feb-15", "10-Apr-15", "28-Feb-14",
"28-Feb-15", "3-Jan-14", "3-Jan-15"), class = "factor"),
END = structure(c(2L, 8L, 4L, 9L, 5L, 1L, 7L, 3L, 6L), .Label = c("1-Feb-14",
"1-Feb-15", "15-Mar-14", "17-Mar-15", "18-Apr-15", "20-Apr-14",
"22-Feb-14", "22-Feb-15", "8-Apr-15"), class = "factor")), .Names = c("ID",
"CLASS", "START", "END"), class = "data.frame", row.names = c(NA,
-9L))
我想按ID列对数据进行分组,然后合并CLASS列中相同值的任何连续出现(按START日期排序),同时选择最小开始日期和最长结束日期。因此,对于ID号100,只有一个实例,其中&#34; G&#34; class是连续的,所以我想将这两行合并为min(START)和max(END)日期。这是一个简单的示例,但在实际数据中,有时需要合并几个连续的行。
我已经尝试过group_by,然后使用某种排名,但这似乎没有做到这一点。关于如何解决这个问题的任何建议?这也是我第一次在SO上发帖,所以我希望这个问题有道理。
结果应如下所示:
ID CLASS START END
100 GA 3-Jan-15 1-Feb-15
100 G 1-Feb-15 22-Feb-15
100 GA 28-Feb-15 17-Mar-15
100 G 1-Apr-15 18-Apr-15
200 FA 3-Jan-14 22-Feb-14
200 G 28-Feb-14 15-Mar-14
200 F 1-Apr-14 20-Apr-14
答案 0 :(得分:6)
以下是一个选项,使用data.table::rleid
为同一ID
和CLASS
的投放设为ID:
# make START and END Date class for easier manipulation
df <- df %>% mutate(START = as.Date(START, '%d-%b-%y'),
END = as.Date(END, '%d-%b-%y'))
# More concise alternative:
# df <- df %>% mutate_each(funs(as.Date(., '%d-%b-%y')), START, END)
# group and make rleid as mentioned above
df %>% group_by(ID, CLASS, rleid = data.table::rleid(ID, CLASS)) %>%
# collapse with summarise, replacing START and END with their min and max for each group
summarise(START = min(START), END = max(END)) %>%
# clean up arrangement and get rid of added rleid column
ungroup() %>% arrange(rleid) %>% select(-rleid)
# Source: local data frame [7 x 4]
#
# ID CLASS START END
# (int) (fctr) (date) (date)
# 1 100 GA 2015-01-03 2015-02-01
# 2 100 G 2015-02-01 2015-02-22
# 3 100 GA 2015-02-28 2015-03-17
# 4 100 G 2015-04-01 2015-04-18
# 5 200 FA 2014-01-03 2014-02-22
# 6 200 G 2014-02-28 2014-03-15
# 7 200 F 2014-04-01 2014-04-20
这是纯data.table模拟:
library(data.table)
setDT(df)
datecols = c("START","END")
df[, (datecols) := lapply(.SD, as.IDate, format = '%d-%b-%y'), .SDcols = datecols]
df[, .(START = START[1L], END = END[.N]), by=.(ID, CLASS, r = rleid(ID, CLASS))][, r := NULL][]