R - 过滤小于一组日期时间的日期时间

时间:2017-09-13 18:47:14

标签: r datetime dplyr lubridate

我有两个数据帧。

一个是我的data,其中我有几个变量列和几个与日期时间相关的列(日期时间,星期#,日期,小时,分钟,秒),其中包含2017年的每日数据。例如,

> glimpse(data)
Observations: 8,001,013
Variables: 12

$ id                 <chr> "(2, 3, 4)", "(5,)", "(6,)", "(7,)", "(8,)", "(9,)", "(10,)", "(11,)", "(12,)", "(13,)", "(14,)", "(15,)", "(16,)", "(17,)", "(18,)", "(19,)", "(20,)", "(21,...
$ x                  <int> 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1...
$ num                <chr> "set([4225])", "set([4712])", "set([5271])", "set([5334])", "set([5395])", "set([5658])", "set([5889])", "set([6020])", "set([6063])", "set([6090])", "set([6...
$ w                  <int> 4, 6, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 7, 1, 3, 2, 1, 1, 3, 2, 3, 2, 1, 1, 2, 1, 1, 4, 1, 2, 3, 1, 1, 1, 1, 3, 1, 1, 1, 2, 3, 1, 1, 4, 1, 2, 1...
$ z                  <int> 4, 6, -1, -1, 1, 1, -1, -1, -1, -1, -1, -1, -1, -2, -1, -1, -2, 2, 7, 1, -3, -2, 1, -1, 3, 2, 3, -2, -1, -1, -2, -1, -1, 4, 1, 2, -3, 1, 1, 1, 1, -3, 1, 1, 1...
$ datetime           <dttm> 2017-02-19 18:00:00, 2017-02-19 18:00:00, 2017-02-19 18:00:00, 2017-02-19 18:00:00, 2017-02-19 18:00:00, 2017-02-19 18:00:01, 2017-02-19 18:00:01, 2017-02-1...
$ date               <date> 2017-02-19, 2017-02-19, 2017-02-19, 2017-02-19, 2017-02-19, 2017-02-19, 2017-02-19, 2017-02-19, 2017-02-19, 2017-02-19, 2017-02-19, 2017-02-19, 2017-02-19, ...
$ day_of_week        <ord> Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun, Su...
$ week               <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
$ hour               <int> 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 18, 1...
$ minute             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ second             <dbl> 0.1187501, 0.3406179, 0.7030604, 0.7431633, 0.7939658, 1.0090485, 1.1624568, 1.2924566, 1.3619752, 1.3922081, 1.4920712, 1.5121725, 1.5621316, 1.6688271, 1.7...

另一个数据框只是一个包含8个关键日期时间的小列表,key_datetimes,例如。

> key_datetimes
# A tibble: 9 x 2
         Code         keyDateTime
        <chr>               <chr>
1       TAIL1 2017-01-12 08:30:00
2       TAIL2 2017-02-09 11:40:00
3       TAIL3 2017-03-22 08:30:01
4       TAIL4 2017-04-13 10:30:00
5       TAIL5 2017-05-19 08:30:00
6       TAIL6 2017-06-13 08:35:00
7       TAIL7 2017-07-28 09:30:00
8       TAIL8 2017-08-23 06:30:00
9       TAIL9 2017-09-13 07:30:00

我希望将每周的数据标记为key_datetimes中的特定日期时间。所以我想在data中创建一个名为before_key_datetime的新列,如果data$datetime < key_datetimes为FALSE则为TRUE。

我该如何做到这一点?

换句话说,我想要做的只是保持感兴趣的几周(这是有效的),然后我每周分组(这个工作)然后为每个组标签/变异一个新列,说明哪些行在之前/来自第二个数据帧的keyDateTime之后(无法获得此部分)。

我尝试过的事情:

  • 在一个日期时间内执行ifelse
  • 向矢量执行ifelse向量显然不会向量: data %>% filter(week %in% lubridate::week(as.Date(key_datetimes$keyDateTime))) %>% group_by(week) %>% filter(datetime %in% c(as.POSIXct(key_datetimes$keyDateTime)))

1 个答案:

答案 0 :(得分:1)

不是100%确定这是您想要的,但在将您的密钥与数据合并并按日期排列后尝试使用zoo::na.locf

在每个关键日期之前填写

library(dplyr)
library(zoo)
df %>%
  full_join(key, by="Date") %>%
  arrange(Date) %>%
  mutate_at(vars(Code, key), funs(zoo::na.locf(., na.rm=FALSE, fromLast=TRUE)))

输出(头)

                   Date Code   key
1   2017-01-02 00:00:01    1 TAIL1
2   2017-01-03 00:00:01    1 TAIL1
3   2017-01-04 00:00:01    1 TAIL1
4   2017-01-05 00:00:01    1 TAIL1
5   2017-01-06 00:00:01    1 TAIL1
6   2017-01-07 00:00:01    1 TAIL1
7   2017-01-08 00:00:01    1 TAIL1
8   2017-01-09 00:00:01    1 TAIL1

更简单的例子

simple <- head(df)
ans <- simple %>%
  full_join(key, by="Date") %>%
  arrange(Date)

我遗漏了最后一个mutate_at步骤。 full_join仅用于组合两者的数据(不丢失数据)。来自simplekey的所有数据仍然存在于输出中。关键是要结合所有数据然后对它们进行排序。这是查看条目顺序的简便方法。

                  Date Code   key
1  2017-01-02 00:00:01   NA  <NA>    # from simple
2  2017-01-03 00:00:01   NA  <NA>    # from simple
3  2017-01-04 00:00:01   NA  <NA>    # from simple
4  2017-01-05 00:00:01   NA  <NA>    # from simple
5  2017-01-06 00:00:01   NA  <NA>    # from simple
6  2017-01-07 00:00:01   NA  <NA>    # from simple
7  2017-01-12 08:30:00    1 TAIL1    # from key
8  2017-02-09 11:40:00    2 TAIL2
9  2017-03-22 08:30:01    3 TAIL3
10 2017-04-13 10:30:00    4 TAIL4
11 2017-05-19 08:30:00    5 TAIL5
12 2017-06-13 08:35:00    6 TAIL6
13 2017-07-28 09:30:00    7 TAIL7
14 2017-08-23 06:30:00    8 TAIL8
15 2017-09-13 07:30:00    9 TAIL9

mutate_at将填充所有NA s,其中最近的值向后移动。因此,第1-6行中的代码和键将采用第7行中的值。

您现在可以使用代码或密钥过滤您感兴趣的日期,或者确定完整数据框中的日期是否在关键日期之前。例如,

ans <- df %>%
  full_join(key, by="Date") %>%
  arrange(Date) %>%
  mutate_at(vars(Code, key), funs(zoo::na.locf(., na.rm=FALSE, fromLast=TRUE)))

要查找键[1,] 1 TAIL1 2017-01-12 08:30:00之前的日期,您可以执行

ans %>%
   filter(Code==1)
# The last row is from your key data frame

ans %>%
   filter(key=="TAIL1")

确定数据框中的日期是否在关键日期之前

ans[3, ]
#                  Date Code   key
# 3 2017-01-04 00:00:01    1 TAIL1

告诉您数据框中的第3个条目位于键[1,] 1 TAIL1 2017-01-12 08:30:00

之前

以防万一

在每个关键日期

之后填写
df %>%
  full_join(key, by="Date") %>%
  arrange(Date) %>%
  mutate_at(vars(Code, key), funs(zoo::na.locf(., na.rm=FALSE, fromLast=FALSE)))

输出(尾部)

363 2017-12-21 00:00:01    9 TAIL9
364 2017-12-22 00:00:01    9 TAIL9
365 2017-12-23 00:00:01    9 TAIL9
366 2017-12-24 00:00:01    9 TAIL9
367 2017-12-25 00:00:01    9 TAIL9
368 2017-12-26 00:00:01    9 TAIL9
369 2017-12-27 00:00:01    9 TAIL9
370 2017-12-28 00:00:01    9 TAIL9
371 2017-12-29 00:00:01    9 TAIL9
372 2017-12-30 00:00:01    9 TAIL9
373 2017-12-31 00:00:01    9 TAIL9
374 2018-01-01 00:00:01    9 TAIL9

数据

df <- data.frame(Date = ymd_hms("2017-01-01 00:00:01") + days(x=1:365))

key <- structure(list(Code = 1:9, key = c("TAIL1", "TAIL2", "TAIL3", 
"TAIL4", "TAIL5", "TAIL6", "TAIL7", "TAIL8", "TAIL9"), Date = structure(c(1484209800, 
1486640400, 1490171401, 1492079400, 1495182600, 1497342900, 1501234200, 
1503469800, 1505287800), tzone = "UTC", class = c("POSIXct", 
"POSIXt"))), class = "data.frame", .Names = c("Code", "key", 
"Date"), row.names = c(NA, -9L))