我正在处理一个多年数据集,其中包含日期列(%Y-%m-%d)和多个变量的每日值。
在R中,我如何按日期范围(即6月29日+/- 5天)对数据进行子集化,但是捕获所有年份的数据?
DATE A B C
1996-06-10 12:00:00 178.0 24.1 1.7
1996-06-11 12:00:00 184.1 30.2 1.1
1996-06-12 12:00:00 187.2 29.4 1.8
1996-06-13 12:00:00 194.4 35.0 5.3
1996-06-14 12:00:00 200.3 35.9 1.5
1996-06-15 12:00:00 138.9 15.1 0.0
...
答案 0 :(得分:2)
基础R尝试。
凯文从其他答案中窃取示例数据:
df <- data.frame(
my_date = seq.Date(as.Date("1990-01-01"), as.Date("1999-12-31"), by = 1),
x = rnorm(3652),
y = rnorm(3652),
z = rnorm(3652)
)
为选择设置变量:
month_num <- 6
day_num <- 29
bound <- 5
找出您年龄范围内的关键日期:
keydates <- as.Date(sprintf(
"%d-%02d-%02d",
do.call(seq, as.list(as.numeric(range(format(df$my_date, "%Y"))))),
month_num,
day_num
))
进行选择:
out <- df[df$my_date %in% outer(keydates, -bound:bound, `+`),]
检查它是否有效:
table(format(out$my_date, "%m-%d"))
#06-24 06-25 06-26 06-27 06-28 06-29 06-30 07-01 07-02 07-03 07-04
# 10 10 10 10 10 10 10 10 10 10 10
1990年至1999年每一天/每月的一个有效值,以“06-29”为中心,任何一方为5天
答案 1 :(得分:2)
让yrs
成为数据中的所有唯一年份,targets
为目标的月份和日期中的每一年。dates
。然后创建delta
,其中包含targets
中任意值sapply
天内的所有日期。请注意,dates
剥离了"Date"
类的%in%
,但这并不重要,因为它后来只在DF
中使用而忽略了该类。最后将DATE
子集缩小到dates
位于# inputs (also DF defined in Note at end)
target <- "06-19"
delta <- 5
DATE <- as.Date(DF$DATE)
yrs <- unique(format(DATE, "%Y"))
targets <- as.Date(paste(yrs, target, sep = "-"))
dates <- c(sapply(targets, "+", seq(-delta, delta)))
DF[DATE %in% dates, ]
的行。没有包使用。
DATE A B C
5 1996-06-14 12:00:00 200.3 35.9 1.5
6 1996-06-15 12:00:00 138.9 15.1 0.0
,并提供:
DATE
或者,可以使用单个SQL语句完成此操作。请注意,我们假设DF
列是字符,因为问题是指它采用特定格式。现在,使用相同的输入,内部选择生成每年的目标日期,然后外部选择将delta
连接到任何目标日期library(sqldf)
library(RH2)
# inputs (also DF defined in Note at end)
target <- "06-19"
delta <- 5
fn$sqldf("select DF.* from DF
join (select distinct cast(substr(DATE, 1, 4) || '-' || '$target' as DATE) as target
from DF)
on cast(substr(DATE, 1, 10) as DATE) between target - $delta and target + $delta")
天内的那些行。我们在这里使用H2数据库后端,因为它具有比SQLite更好的日期支持。
DATE A B C
1 1996-06-14 12:00:00 200.3 35.9 1.5
2 1996-06-15 12:00:00 138.9 15.1 0.0
,并提供:
DATE
如果"Date"
属于R sqldf
类,我们可以稍微简化SQL。也就是说,将上面的DF2 <- transform(DF, DATE = as.Date(DATE))
fn$sqldf("select DF2.* from DF2
join (select distinct cast(year(DATE) || '-' || '$target' as DATE) as target from DF2)
on DATE between target - $delta and target + $delta")
语句替换为:
DATE A B C
1 1996-06-14 200.3 35.9 1.5
2 1996-06-15 138.9 15.1 0.0
,并提供:
DF
输入DF <- structure(list(DATE = c("1996-06-10 12:00:00", "1996-06-11 12:00:00",
"1996-06-12 12:00:00", "1996-06-13 12:00:00", "1996-06-14 12:00:00",
"1996-06-15 12:00:00"), A = c(178, 184.1, 187.2, 194.4, 200.3,
138.9), B = c(24.1, 30.2, 29.4, 35, 35.9, 15.1), C = c(1.7, 1.1,
1.8, 5.3, 1.5, 0)), .Names = c("DATE", "A", "B", "C"), row.names = c(NA,
-6L), class = "data.frame")
假定为:
Map
答案 2 :(得分:1)
您可以使用lubridate时间间隔提供有效的日期范围,然后使用purrr地图运行数据的每个时间间隔进行过滤。
library(dplyr)
library(lubridate)
library(magrittr) # only because I've used the "exposition" (%$%) pipe
library(purrr)
df <- tibble(
my_date = as.POSIXct(
seq.Date(as.Date("1990-01-01"), as.Date("1999-12-31"), by = 1),
tz = "UTC"
),
x = rnorm(3652),
y = rnorm(3652),
z = rnorm(3652)
)
month_num <- 6
day_num <- 29
bound <- 5
date_span <- df %>%
select(my_date) %>%
filter(month(my_date) == month_num & day(my_date) == day_num) %>%
mutate(
start = my_date - days(bound),
end = my_date + days(bound)
) %$%
interval(start, end, tzone = "UTC")
map_dfr(date_span, ~filter(df, my_date %within% .x))
# # A tibble: 110 x 4
# my_date x y z
# <dttm> <dbl> <dbl> <dbl>
# 1 1990-06-24 10:00:00 0.404 1.33 1.58
# 2 1990-06-25 10:00:00 0.351 -1.73 0.665
# 3 1990-06-26 10:00:00 -0.512 1.01 1.72
# 4 1990-06-27 10:00:00 1.55 0.417 -0.126
# 5 1990-06-28 10:00:00 1.86 1.18 0.322
# 6 1990-06-29 10:00:00 -0.0193 -0.105 0.356
# 7 1990-06-30 10:00:00 0.844 -0.712 1.51
# 8 1990-07-01 10:00:00 -0.431 0.451 -2.19
# 9 1990-07-02 10:00:00 1.74 -0.0650 -0.866
# 10 1990-07-03 10:00:00 0.965 -0.506 -0.0690
# # ... with 100 more rows