我有一个包含多列的数据框,其中一列是POSIXct类。我想从我的数据框中删除行,其中行的日期/时间(根据POSIXct列确定)之前没有过去24小时内的日期/时间,不包括前3个小时。
在Excel中,我可以通过创建这样的新列来轻松完成此任务:
=IF(COUNTIFS(datetimecolumn, "<" & currentdatetime, datetimecolumn, ">" & (currentdatetime-1), datetimecolumn, "<" & (currentdatetime-3/24)) > 0, 1, 0)
然后相应删除。
我可以看到在R中使用“if-statements”进行“for-loop”并完成相同的任务,但我想知道是否有更简约的方法,比如data.table或dplyr。下面是我最右边的Excel解决方案的数据示例,其中0是守护者,1是要删除。
datetime test
7/24/2012 12:15 0 #First point, so no issues
7/24/2012 15:00 0 #Even though this point is within 24 hours of the previous point, it is less than 3 hours, so it's OK
7/24/2012 15:15 0 #Ditto for this point
7/24/2012 15:30 1 #Now this point is out of the three hour window, so it's bad
7/24/2012 16:00 1 #Ditto for this point
7/24/2012 17:00 1 #Ditto for this point
7/24/2012 17:30 1 #Ditto for this point
7/28/2012 20:15 0 #This point has no previous points within 24 hours, so OK
7/29/2012 6:30 1 #This point has a previous point within 24 hours that is also not in a previous 3 hour window, so it's bad
7/30/2012 16:30 0 #This point has no previous points within 24 hours, so OK
7/30/2012 16:45 0
7/30/2012 17:00 0
7/30/2012 17:15 0
7/30/2012 17:30 0
7/30/2012 17:45 0
7/30/2012 18:00 0
7/30/2012 18:15 0
7/31/2012 16:45 1
8/2/2012 20:15 0
8/3/2012 16:00 1
8/4/2012 17:45 0
8/4/2012 18:00 0
8/4/2012 18:30 0
8/4/2012 19:15 0
8/4/2012 19:30 0
8/4/2012 19:45 0
8/4/2012 20:30 0
8/5/2012 9:15 1
8/5/2012 9:30 1
非常感谢任何帮助。谢谢!
数据,由@jeremycg提供:
data = structure(list(datetime = structure(c(1343146500, 1343156400,
1343157300, 1343158200, 1343160000, 1343163600, 1343165400, 1343520900,
1343557800, 1343680200, 1343681100, 1343682000, 1343682900, 1343683800,
1343684700, 1343685600, 1343686500, 1343767500, 1343952900, 1344024000,
1344116700, 1344117600, 1344119400, 1344122100, 1344123000, 1344123900,
1344126600, 1344172500, 1344173400), class = c("POSIXct", "POSIXt"
), tzone = ""), test = c(0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 1L, 1L)), .Names = c("datetime", "test"), row.names = c(NA,
-29L), class = "data.frame")
答案 0 :(得分:1)
我认为这就是你想要的。 首先,将您的数据转换为正确的日期格式:
data$datetime <- as.POSIXct(data$datetime, format = "%m/%d/%Y %R")
然后我们创建一个列,在24小时内找不到任何时间点,然后将cumsum
转换为group_by
(初始组)。然后我们发现这些组中的每个成员都在开始的3小时内。
我认为你使用0来保持和1来排除会引起一些混乱,因为R与其默认值相反(即as.numeric(TRUE)
是1),但我会保持你的方式。< / p>
library(dplyr)
data %>% mutate(initialgroup = cumsum(c(24*60, diff(datetime)) >= 24*60)) %>%
group_by(initialgroup) %>%
mutate(ingroup = +((datetime - datetime[1]) > 180*60))
给出了:
datetime test initialgroup ingroup
1 2012-07-24 12:15:00 0 1 0
2 2012-07-24 15:00:00 0 1 0
3 2012-07-24 15:15:00 0 1 0
4 2012-07-24 15:30:00 1 1 1
5 2012-07-24 16:00:00 1 1 1
6 2012-07-24 17:00:00 1 1 1
7 2012-07-24 17:30:00 1 1 1
8 2012-07-28 20:15:00 0 2 0
9 2012-07-29 06:30:00 1 2 1
10 2012-07-30 16:30:00 0 3 0
11 2012-07-30 16:45:00 0 3 0
12 2012-07-30 17:00:00 0 3 0
13 2012-07-30 17:15:00 0 3 0
14 2012-07-30 17:30:00 0 3 0
15 2012-07-30 17:45:00 0 3 0
16 2012-07-30 18:00:00 0 3 0
17 2012-07-30 18:15:00 0 3 0
18 2012-07-31 16:45:00 1 3 1
19 2012-08-02 20:15:00 0 4 0
20 2012-08-03 16:00:00 1 4 1
21 2012-08-04 17:45:00 0 5 0
22 2012-08-04 18:00:00 0 5 0
23 2012-08-04 18:30:00 0 5 0
24 2012-08-04 19:15:00 0 5 0
25 2012-08-04 19:30:00 0 5 0
26 2012-08-04 19:45:00 0 5 0
27 2012-08-04 20:30:00 0 5 0
28 2012-08-05 09:15:00 1 5 1
29 2012-08-05 09:30:00 1 5 1
使用的数据(数据时间转换后):
structure(list(datetime = structure(c(1343146500, 1343156400,
1343157300, 1343158200, 1343160000, 1343163600, 1343165400, 1343520900,
1343557800, 1343680200, 1343681100, 1343682000, 1343682900, 1343683800,
1343684700, 1343685600, 1343686500, 1343767500, 1343952900, 1344024000,
1344116700, 1344117600, 1344119400, 1344122100, 1344123000, 1344123900,
1344126600, 1344172500, 1344173400), class = c("POSIXct", "POSIXt"
), tzone = ""), test = c(0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 1L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 1L, 1L)), .Names = c("datetime", "test"), row.names = c(NA,
-29L), class = "data.frame")