我有两个数据集,“ Df_A”和“ Df_B”:
Df_A
Date Info A Info B
9/19/18 23:00 36 48
9/18/18 23:00 47 30
9/17/18 23:00 51 3
8/14/18 23:00 45 16
8/6/18 23:00 37 13
8/5/18 23:00 42 66
7/11/18 23:00 42 53
7/4/18 23:00 38 10
Df_B
Released Info Event Value X
9/6/2018 22:30 Event A 51.8
8/6/2018 22:30 Event A 52
7/5/2018 22:30 Event A 50.6
6/6/2018 22:30 Event A 54
9/2/2018 22:30 Event C 48
7/31/2018 22:30 Event C 45
9/4/2018 22:30 Event D 58.7
8/2/2018 22:30 Event D 56.2
7/3/2018 22:30 Event D 57.3
6/4/2018 22:30 Event D 51.1
5/2/2018 22:30 Event D 54.2
4/4/2018 22:30 Event D 59.8
9/3/2018 1:30 Event E 61.8
8/6/2018 1:30 Event E 63
7/2/2018 1:30 Event E 65.2
“日期”和“ Released.info”都是因素。
我有一个向量“事件”,其中包含我需要解析的“ Df_B”中的事件,例如
Events <- c("Event A", "Event D")
对于“ Df_B”中的每个“事件”,我想检查“ Df_A”中的“日期”是否大于“ Df_B”中的“已发布信息”。如果是这样,我想将“事件A”和“事件B”的相应值添加到“ Df_A”。
所需的输出:
Date Info A Info B Event A Event D
9/19/18 23:00 36 48 51.8 58.7
9/18/18 23:00 47 30 51.8 58.7
9/17/18 23:00 51 3 51.8 58.7
8/14/18 23:00 45 16 52 56.2
8/6/18 23:00 37 13 52 56.2
8/5/18 23:00 42 66 50.6 56.2
7/11/18 23:00 42 53 50.6 57.3
7/4/18 23:00 38 10 54 57.3
例如,对于“ Df_A”中的9/19/18 23:00
,9/18/18 23:00
和9/17/18 23:00
,对于“事件A”组,“ Df_B”中最接近的先前日期是9/6/2018 22:30
。因此,对于这些行,我们从“ Df_B”中选择值51.8。对于Df_A
中的所有日期,以及'Df_B'中的'事件A'和'事件B'依此类推。
我想在'Df_A'中添加新的n列,在此示例中为'事件A'和'事件D',但是可以更多。
为此,我一直在尝试为诸如此类的事件的动态数量创建一些动态变量(因为事件来自作为矩阵的csv):
#To Create a variable for each Event
ListEvents <- as.list(as.vector(Events))
names(ListEvents) <- paste("Variable", 1:length(ListEvents), sep = "")
list2env(ListEvents,envir = .GlobalEnv)
为每个事件创建一个变量之后,我正在考虑创建一个循环,以便可以为每个事件创建一个子集,然后将日期(Df_A)与发布日期(Df_B)比较,并将其添加为Df_A中的一列。但是我知道这是不必要的复杂而低效的方法。有人可以帮我吗?
答案 0 :(得分:3)
以下内容再现了您的预期输出:
events <- c("Event A", "Event D")
library(tidyverse)
library(lubridate)
map(events, ~Df_A %>%
mutate(Event := .x) %>%
left_join(Df_B) %>%
mutate(
Date = mdy_hm(Date),
Released.Info = mdy_hm(Released.Info)) %>%
group_by(Date) %>%
mutate(diff = difftime(Released.Info, Date, units = "days")) %>%
filter(diff < 0) %>%
filter(diff == max(diff)) %>%
select(-Released.Info, -diff) %>%
spread(Event, Value.X)) %>%
reduce(left_join) %>%
arrange(desc(Date))
## A tibble: 8 x 5
## Groups: Date [8]
# Date Info.A Info.B `Event A` `Event D`
# <dttm> <int> <int> <dbl> <dbl>
#1 2018-09-19 23:00:00 36 48 51.8 58.7
#2 2018-09-18 23:00:00 47 30 51.8 58.7
#3 2018-09-17 23:00:00 51 3 51.8 58.7
#4 2018-08-14 23:00:00 45 16 52 56.2
#5 2018-08-06 23:00:00 37 13 52 56.2
#6 2018-08-05 23:00:00 42 66 50.6 56.2
#7 2018-07-11 23:00:00 42 53 50.6 57.3
#8 2018-07-04 23:00:00 38 10 54 57.3
这个想法是将Events
列添加到Df_A
中,并在向量events
中给定条目;然后,我们进行Df_A
和Df_B
的左联接,并仅选择Released.Info
和Date
之间负时间差最短的行(即filter(diff < 0)
和filter(diff == max(diff))
部分)。其余的将重塑和重新安排以重现您的预期输出。
Df_A <-read.table(text =
" Date 'Info A' 'Info B'
'9/19/18 23:00' 36 48
'9/18/18 23:00' 47 30
'9/17/18 23:00' 51 3
'8/14/18 23:00' 45 16
'8/6/18 23:00' 37 13
'8/5/18 23:00' 42 66
'7/11/18 23:00' 42 53
'7/4/18 23:00' 38 10", header = T)
Df_B <- read.table(text =
"'Released Info' Event 'Value X'
'9/6/2018 22:30' 'Event A' 51.8
'8/6/2018 22:30' 'Event A' 52
'7/5/2018 22:30' 'Event A' 50.6
'6/6/2018 22:30' 'Event A' 54
'9/2/2018 22:30' 'Event C' 48
'7/31/2018 22:30' 'Event C' 45
'9/4/2018 22:30' 'Event D' 58.7
'8/2/2018 22:30' 'Event D' 56.2
'7/3/2018 22:30' 'Event D' 57.3
'6/4/2018 22:30' 'Event D' 51.1
'5/2/2018 22:30' 'Event D' 54.2
'4/4/2018 22:30' 'Event D' 59.8
'9/3/2018 1:30' 'Event E' 61.8
'8/6/2018 1:30' 'Event E' 63
'7/2/2018 1:30' 'Event E' 65.2", header = T)
答案 1 :(得分:1)
这可以通过在data.table
中按组进行滚动联接来完成。
library(data.table)
# convert data to data.table
setDT(Df_A)
setDT(Df_B)
# convert times to POSIXct
Df_A[ , Date := as.POSIXct(Date, format = "%m/%d/%y %H:%M")]
Df_B[ , Released.Info := as.POSIXct(Released.Info, format = "%m/%d/%Y %H:%M")]
# select rows
db <- Df_B[Event %in% Events]
# rolling join: for each Event in db, join to Df_A by nearest preceeding time
d2 <- db[ , .SD[Df_A, on = c(Released.Info = "Date"), roll = Inf], by = Event]
# Event Released.Info Value.X Info.A Info.B
# 1: Event A 2018-09-19 23:00:00 51.8 36 48
# 2: Event A 2018-09-18 23:00:00 51.8 47 30
# [snip]
# 7: Event A 2018-07-11 23:00:00 50.6 42 53
# 8: Event A 2018-07-04 23:00:00 54.0 38 10
# 9: Event D 2018-09-19 23:00:00 58.7 36 48
# 10: Event D 2018-09-18 23:00:00 58.7 47 30
# [snip]
# 15: Event D 2018-07-11 23:00:00 57.3 42 53
# 16: Event D 2018-07-04 23:00:00 57.3 38 10
基本上就是这样。如果需要,将“事件”列转换为宽列并加入“ Df_A”:
dcast(d2[ , .(Event, Released.Info, Value.X)],
Released.Info ~ Event, value.var = "Value.X")[
Df_A, on = c(Released.Info = "Date")]
# Released.Info Event A Event D Info.A Info.B
# 1: 2018-09-19 23:00:00 51.8 58.7 36 48
# 2: 2018-09-18 23:00:00 51.8 58.7 47 30
# 3: 2018-09-17 23:00:00 51.8 58.7 51 3
# 4: 2018-08-14 23:00:00 52.0 56.2 45 16
# 5: 2018-08-06 23:00:00 52.0 56.2 37 13
# 6: 2018-08-05 23:00:00 50.6 56.2 42 66
# 7: 2018-07-11 23:00:00 50.6 57.3 42 53
# 8: 2018-07-04 23:00:00 54.0 57.3 38 10