删除R中数据中不连续的日期

时间:2016-06-08 15:18:15

标签: r dplyr

我有一个数据框,我想过滤掉日期不连续的条目。换句话说,我正在研究连续日期的集群。

a %>% group_by(day) %>% summarise(count = n()) %>% mutate(day_dif = day - lag(day))

来源:本地数据框[20 x 3]

          day count day_dif
       (date) (int)  (dfft)
1  2016-02-02    12 NA days
2  2016-02-03    80  1 days
3  2016-02-04   102  1 days
4  2016-02-05    97  1 days
5  2016-02-06   118  1 days
6  2016-02-07   115  1 days
7  2016-02-08     4  1 days
8  2016-02-20    13 12 days
9  2016-02-21   136  1 days
10 2016-02-22   114  1 days
11 2016-02-23   134  1 days
12 2016-02-24   126  1 days
13 2016-02-25   128  1 days
14 2016-02-26    63  1 days
15 2016-02-27   118  1 days
16 2016-03-06     1  8 days
17 2016-03-29    28 23 days
18 2016-04-03    18  5 days
19 2016-04-08    18  5 days
20 2016-04-27    23 19 days

在此,我想过滤掉日期不连续的条目。例如,2016-03-06,2016-03-29,2016-04-03是需要删除的单日条目。我只关注连续几天的参赛作品。多天发生的条目。我正在寻找的理想输出是

          day count day_dif  Cluster
       (date) (int)  (dfft)
1  2016-02-02    12 NA days     1
2  2016-02-03    80  1 days     1
3  2016-02-04   102  1 days     1
4  2016-02-05    97  1 days     1
5  2016-02-06   118  1 days     1
6  2016-02-07   115  1 days     1 
7  2016-02-08     4  1 days     1
8  2016-02-20    13 12 days     2
9  2016-02-21   136  1 days     2
10 2016-02-22   114  1 days     2
11 2016-02-23   134  1 days     2
12 2016-02-24   126  1 days     2
13 2016-02-25   128  1 days     2
14 2016-02-26    63  1 days     2
15 2016-02-27   118  1 days     2

其中,群集列指示日期群集,并且输出也会删除单个日期。集群列中的1表示第一组日期,2表示第二组日期。 If there are more than 3 continuous days, I want to consider as on cluster

我试图通过使用滞后函数和所有这些来做到这一点。但没有太大的成功。有人可以帮我这么做吗?任何想法都将不胜感激。

由于

2 个答案:

答案 0 :(得分:1)

我们可以使用rle对行进行子集

i1 <- c(TRUE, a1$day_dif[-1] >=3)
i2 <- inverse.rle(within.list(rle(i1), {values1 <- values
           values[values1 &lengths >3] <- FALSE
           values[!values1]<- TRUE}))
a1$Cluster <- cumsum(i1)
a1[i2,]
#          day count day_dif Cluster
#1  2016-02-02    12 NA days       1
#2  2016-02-03    80  1 days       1
#3  2016-02-04   102  1 days       1
#4  2016-02-05    97  1 days       1
#5  2016-02-06   118  1 days       1
#6  2016-02-07   115  1 days       1
#7  2016-02-08     4  1 days       1
#8  2016-02-20    13 12 days       2
#9  2016-02-21   136  1 days       2
#10 2016-02-22   114  1 days       2
#11 2016-02-23   134  1 days       2
#12 2016-02-24   126  1 days       2
#13 2016-02-25   128  1 days       2
#14 2016-02-26    63  1 days       2
#15 2016-02-27   118  1 days       2

以上代码也可以链接(%>%

a1 %>%
   mutate(i1 = c(TRUE, day_dif[-1] >=3))  %>%
   do(data.frame(., i2 = inverse.rle(within.list(rle(.$i1), {
                     values1 <- values
                     values[values1 & lengths >3] <- FALSE
                     values[!values1] <- TRUE
                      })))) %>%
   mutate(Cluster = cumsum(i1)) %>%
   filter(i2) %>% 
   select(-i1, -i2)
#          day count day_dif Cluster
#1  2016-02-02    12 NA days       1
#2  2016-02-03    80  1 days       1
#3  2016-02-04   102  1 days       1
#4  2016-02-05    97  1 days       1
#5  2016-02-06   118  1 days       1
#6  2016-02-07   115  1 days       1
#7  2016-02-08     4  1 days       1
#8  2016-02-20    13 12 days       2
#9  2016-02-21   136  1 days       2
#10 2016-02-22   114  1 days       2
#11 2016-02-23   134  1 days       2
#12 2016-02-24   126  1 days       2
#13 2016-02-25   128  1 days       2
#14 2016-02-26    63  1 days       2
#15 2016-02-27   118  1 days       2

数据

a <- structure(list(day = structure(c(16833, 16834, 16835, 16836, 
16837, 16838, 16839, 16851, 16852, 16853, 16854, 16855, 16856, 
16857, 16858, 16866, 16889, 16894, 16899, 16918), class = "Date"), 
count = c(12L, 80L, 102L, 97L, 118L, 115L, 4L, 13L, 136L, 
114L, 134L, 126L, 128L, 63L, 118L, 1L, 28L, 18L, 18L, 23L
)), .Names = c("day", "count"), row.names = c("1", "2", "3", 
"4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", 
"16", "17", "18", "19", "20"), class = "data.frame")

a1 <- a %>%
        mutate(day_dif = day - lag(day))

答案 1 :(得分:0)

可能有更好的方法来处理第一个NA值。在这里,我手动将其指定为0.然后,因为连续日期的差异将为1,您可以利用此属性创建布尔向量,然后使用cumsum来获取结果。最后,您可以删除长度等于1的那些组。

# Let the first NA equal to 0
df[which(is.na(df), arr.ind=TRUE)] <- 0

df %>% mutate(cluster=cumsum(day_dif !=1)) %>%
  group_by(cluster) %>% filter(length(cluster) > 1) %>% ungroup()

# Source: local data frame [15 x 4]

#          day count day_dif cluster
#        (date) (int)  (dfft)   (int)
# 1  2016-02-02    12  0 days       1
# 2  2016-02-03    80  1 days       1
# 3  2016-02-04   102  1 days       1
# 4  2016-02-05    97  1 days       1
# 5  2016-02-06   118  1 days       1
# 6  2016-02-07   115  1 days       1
# 7  2016-02-08     4  1 days       1
# 8  2016-02-20    13 12 days       2
# 9  2016-02-21   136  1 days       2
# 10 2016-02-22   114  1 days       2
# 11 2016-02-23   134  1 days       2
# 12 2016-02-24   126  1 days       2
# 13 2016-02-25   128  1 days       2
# 14 2016-02-26    63  1 days       2
# 15 2016-02-27   118  1 days       2