根据POSIXct列的多个条件从R data.frame中删除行

时间:2015-10-09 16:59:50

标签: r posixct

我有一个包含多列的数据框,其中一列是POSIXct类。我想从我的数据框中删除行,其中行的日期/时间(根据POSIXct列确定)之前没有过去24小时内的日期/时间,不包括前3个小时。

在Excel中,我可以通过创建这样的新列来轻松完成此任务:

=IF(COUNTIFS(datetimecolumn, "<" & currentdatetime, datetimecolumn, ">" & (currentdatetime-1), datetimecolumn, "<" & (currentdatetime-3/24)) > 0, 1, 0)

然后相应删除。

我可以看到在R中使用“if-statements”进行“for-loop”并完成相同的任务,但我想知道是否有更简约的方法,比如data.table或dplyr。下面是我最右边的Excel解决方案的数据示例,其中0是守护者,1是要删除。

datetime       test
7/24/2012 12:15 0 #First point, so no issues
7/24/2012 15:00 0 #Even though this point is within 24 hours of the previous point, it is less than 3 hours, so it's OK
7/24/2012 15:15 0 #Ditto for this point
7/24/2012 15:30 1 #Now this point is out of the three hour window, so it's bad
7/24/2012 16:00 1 #Ditto for this point
7/24/2012 17:00 1 #Ditto for this point
7/24/2012 17:30 1 #Ditto for this point
7/28/2012 20:15 0 #This point has no previous points within 24 hours, so OK
7/29/2012 6:30  1 #This point has a previous point within 24 hours that is also not in a previous 3 hour window, so it's bad
7/30/2012 16:30 0 #This point has no previous points within 24 hours, so OK
7/30/2012 16:45 0
7/30/2012 17:00 0
7/30/2012 17:15 0
7/30/2012 17:30 0
7/30/2012 17:45 0
7/30/2012 18:00 0
7/30/2012 18:15 0
7/31/2012 16:45 1
8/2/2012 20:15  0
8/3/2012 16:00  1
8/4/2012 17:45  0
8/4/2012 18:00  0
8/4/2012 18:30  0
8/4/2012 19:15  0
8/4/2012 19:30  0
8/4/2012 19:45  0
8/4/2012 20:30  0
8/5/2012 9:15   1
8/5/2012 9:30   1

非常感谢任何帮助。谢谢!

数据,由@jeremycg提供:

data = structure(list(datetime = structure(c(1343146500, 1343156400, 
1343157300, 1343158200, 1343160000, 1343163600, 1343165400, 1343520900, 
1343557800, 1343680200, 1343681100, 1343682000, 1343682900, 1343683800, 
1343684700, 1343685600, 1343686500, 1343767500, 1343952900, 1344024000, 
1344116700, 1344117600, 1344119400, 1344122100, 1344123000, 1344123900, 
1344126600, 1344172500, 1344173400), class = c("POSIXct", "POSIXt"
), tzone = ""), test = c(0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 1L, 1L)), .Names = c("datetime", "test"), row.names = c(NA, 
-29L), class = "data.frame")

1 个答案:

答案 0 :(得分:1)

我认为这就是你想要的。 首先,将您的数据转换为正确的日期格式:

data$datetime <- as.POSIXct(data$datetime, format = "%m/%d/%Y %R")

然后我们创建一个列,在24小时内找不到任何时间点,然后将cumsum转换为group_by(初始组)。然后我们发现这些组中的每个成员都在开始的3小时内。

我认为你使用0来保持和1来排除会引起一些混乱,因为R与其默认值相反(即as.numeric(TRUE)是1),但我会保持你的方式。< / p>

library(dplyr)

data %>% mutate(initialgroup = cumsum(c(24*60, diff(datetime)) >= 24*60)) %>%
         group_by(initialgroup) %>%
         mutate(ingroup = +((datetime - datetime[1]) > 180*60)) 

给出了:

              datetime test initialgroup ingroup
1  2012-07-24 12:15:00    0            1       0
2  2012-07-24 15:00:00    0            1       0
3  2012-07-24 15:15:00    0            1       0
4  2012-07-24 15:30:00    1            1       1
5  2012-07-24 16:00:00    1            1       1
6  2012-07-24 17:00:00    1            1       1
7  2012-07-24 17:30:00    1            1       1
8  2012-07-28 20:15:00    0            2       0
9  2012-07-29 06:30:00    1            2       1
10 2012-07-30 16:30:00    0            3       0
11 2012-07-30 16:45:00    0            3       0
12 2012-07-30 17:00:00    0            3       0
13 2012-07-30 17:15:00    0            3       0
14 2012-07-30 17:30:00    0            3       0
15 2012-07-30 17:45:00    0            3       0
16 2012-07-30 18:00:00    0            3       0
17 2012-07-30 18:15:00    0            3       0
18 2012-07-31 16:45:00    1            3       1
19 2012-08-02 20:15:00    0            4       0
20 2012-08-03 16:00:00    1            4       1
21 2012-08-04 17:45:00    0            5       0
22 2012-08-04 18:00:00    0            5       0
23 2012-08-04 18:30:00    0            5       0
24 2012-08-04 19:15:00    0            5       0
25 2012-08-04 19:30:00    0            5       0
26 2012-08-04 19:45:00    0            5       0
27 2012-08-04 20:30:00    0            5       0
28 2012-08-05 09:15:00    1            5       1
29 2012-08-05 09:30:00    1            5       1

使用的数据(数据时间转换后):

structure(list(datetime = structure(c(1343146500, 1343156400, 
1343157300, 1343158200, 1343160000, 1343163600, 1343165400, 1343520900, 
1343557800, 1343680200, 1343681100, 1343682000, 1343682900, 1343683800, 
1343684700, 1343685600, 1343686500, 1343767500, 1343952900, 1344024000, 
1344116700, 1344117600, 1344119400, 1344122100, 1344123000, 1344123900, 
1344126600, 1344172500, 1344173400), class = c("POSIXct", "POSIXt"
), tzone = ""), test = c(0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 1L, 1L)), .Names = c("datetime", "test"), row.names = c(NA, 
-29L), class = "data.frame")