我有一个数据框,我想过滤掉日期不连续的条目。换句话说,我正在研究连续日期的集群。
a %>% group_by(day) %>% summarise(count = n()) %>% mutate(day_dif = day - lag(day))
来源:本地数据框[20 x 3]
day count day_dif
(date) (int) (dfft)
1 2016-02-02 12 NA days
2 2016-02-03 80 1 days
3 2016-02-04 102 1 days
4 2016-02-05 97 1 days
5 2016-02-06 118 1 days
6 2016-02-07 115 1 days
7 2016-02-08 4 1 days
8 2016-02-20 13 12 days
9 2016-02-21 136 1 days
10 2016-02-22 114 1 days
11 2016-02-23 134 1 days
12 2016-02-24 126 1 days
13 2016-02-25 128 1 days
14 2016-02-26 63 1 days
15 2016-02-27 118 1 days
16 2016-03-06 1 8 days
17 2016-03-29 28 23 days
18 2016-04-03 18 5 days
19 2016-04-08 18 5 days
20 2016-04-27 23 19 days
在此,我想过滤掉日期不连续的条目。例如,2016-03-06,2016-03-29,2016-04-03是需要删除的单日条目。我只关注连续几天的参赛作品。多天发生的条目。我正在寻找的理想输出是
day count day_dif Cluster
(date) (int) (dfft)
1 2016-02-02 12 NA days 1
2 2016-02-03 80 1 days 1
3 2016-02-04 102 1 days 1
4 2016-02-05 97 1 days 1
5 2016-02-06 118 1 days 1
6 2016-02-07 115 1 days 1
7 2016-02-08 4 1 days 1
8 2016-02-20 13 12 days 2
9 2016-02-21 136 1 days 2
10 2016-02-22 114 1 days 2
11 2016-02-23 134 1 days 2
12 2016-02-24 126 1 days 2
13 2016-02-25 128 1 days 2
14 2016-02-26 63 1 days 2
15 2016-02-27 118 1 days 2
其中,群集列指示日期群集,并且输出也会删除单个日期。集群列中的1表示第一组日期,2表示第二组日期。 If there are more than 3 continuous days, I want to consider as on cluster
。
我试图通过使用滞后函数和所有这些来做到这一点。但没有太大的成功。有人可以帮我这么做吗?任何想法都将不胜感激。
由于
答案 0 :(得分:1)
我们可以使用rle
对行进行子集
i1 <- c(TRUE, a1$day_dif[-1] >=3)
i2 <- inverse.rle(within.list(rle(i1), {values1 <- values
values[values1 &lengths >3] <- FALSE
values[!values1]<- TRUE}))
a1$Cluster <- cumsum(i1)
a1[i2,]
# day count day_dif Cluster
#1 2016-02-02 12 NA days 1
#2 2016-02-03 80 1 days 1
#3 2016-02-04 102 1 days 1
#4 2016-02-05 97 1 days 1
#5 2016-02-06 118 1 days 1
#6 2016-02-07 115 1 days 1
#7 2016-02-08 4 1 days 1
#8 2016-02-20 13 12 days 2
#9 2016-02-21 136 1 days 2
#10 2016-02-22 114 1 days 2
#11 2016-02-23 134 1 days 2
#12 2016-02-24 126 1 days 2
#13 2016-02-25 128 1 days 2
#14 2016-02-26 63 1 days 2
#15 2016-02-27 118 1 days 2
以上代码也可以链接(%>%
)
a1 %>%
mutate(i1 = c(TRUE, day_dif[-1] >=3)) %>%
do(data.frame(., i2 = inverse.rle(within.list(rle(.$i1), {
values1 <- values
values[values1 & lengths >3] <- FALSE
values[!values1] <- TRUE
})))) %>%
mutate(Cluster = cumsum(i1)) %>%
filter(i2) %>%
select(-i1, -i2)
# day count day_dif Cluster
#1 2016-02-02 12 NA days 1
#2 2016-02-03 80 1 days 1
#3 2016-02-04 102 1 days 1
#4 2016-02-05 97 1 days 1
#5 2016-02-06 118 1 days 1
#6 2016-02-07 115 1 days 1
#7 2016-02-08 4 1 days 1
#8 2016-02-20 13 12 days 2
#9 2016-02-21 136 1 days 2
#10 2016-02-22 114 1 days 2
#11 2016-02-23 134 1 days 2
#12 2016-02-24 126 1 days 2
#13 2016-02-25 128 1 days 2
#14 2016-02-26 63 1 days 2
#15 2016-02-27 118 1 days 2
a <- structure(list(day = structure(c(16833, 16834, 16835, 16836,
16837, 16838, 16839, 16851, 16852, 16853, 16854, 16855, 16856,
16857, 16858, 16866, 16889, 16894, 16899, 16918), class = "Date"),
count = c(12L, 80L, 102L, 97L, 118L, 115L, 4L, 13L, 136L,
114L, 134L, 126L, 128L, 63L, 118L, 1L, 28L, 18L, 18L, 23L
)), .Names = c("day", "count"), row.names = c("1", "2", "3",
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15",
"16", "17", "18", "19", "20"), class = "data.frame")
a1 <- a %>%
mutate(day_dif = day - lag(day))
答案 1 :(得分:0)
可能有更好的方法来处理第一个NA
值。在这里,我手动将其指定为0.然后,因为连续日期的差异将为1,您可以利用此属性创建布尔向量,然后使用cumsum
来获取结果。最后,您可以删除长度等于1的那些组。
# Let the first NA equal to 0
df[which(is.na(df), arr.ind=TRUE)] <- 0
df %>% mutate(cluster=cumsum(day_dif !=1)) %>%
group_by(cluster) %>% filter(length(cluster) > 1) %>% ungroup()
# Source: local data frame [15 x 4]
# day count day_dif cluster
# (date) (int) (dfft) (int)
# 1 2016-02-02 12 0 days 1
# 2 2016-02-03 80 1 days 1
# 3 2016-02-04 102 1 days 1
# 4 2016-02-05 97 1 days 1
# 5 2016-02-06 118 1 days 1
# 6 2016-02-07 115 1 days 1
# 7 2016-02-08 4 1 days 1
# 8 2016-02-20 13 12 days 2
# 9 2016-02-21 136 1 days 2
# 10 2016-02-22 114 1 days 2
# 11 2016-02-23 134 1 days 2
# 12 2016-02-24 126 1 days 2
# 13 2016-02-25 128 1 days 2
# 14 2016-02-26 63 1 days 2
# 15 2016-02-27 118 1 days 2