我有两个data.frames,一个有事件数据,另一个有几个公司的股票数据(这里只有两个)。我希望在我的事件data.frame中为两家公司增加两个日期滞后(-1天和+1天)的列。滞后日期应该来自我的库存数据。框架(df)当然。我怎样才能做到这一点?
DATE <- c("01.01.2000","02.01.2000","03.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000","01.01.2000","02.01.2000","04.01.2000","06.01.2000","07.01.2000","09.01.2000","10.01.2000")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
COMP <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B")
df <- data.frame(DATE, RET, COMP)
df
# DATE RET COMP
# 1 01.01.2000 -2.00 A
# 2 02.01.2000 1.10 A
# 3 03.01.2000 3.00 A
# 4 06.01.2000 1.40 A
# 5 07.01.2000 -0.20 A
# 6 09.01.2000 0.60 A
# 7 10.01.2000 0.10 A
# 8 01.01.2000 -0.21 B
# 9 02.01.2000 -1.20 B
# 10 04.01.2000 0.90 B
# 11 06.01.2000 0.30 B
# 12 07.01.2000 -0.10 B
# 13 09.01.2000 0.30 B
# 14 10.01.2000 -0.12 B
DATE <- c("02.01.2000","03.01.2000","06.01.2000","09.01.2000","06.01.2000","07.01.2000","09.01.2000")
ARTICLE <- c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23")
COMP <- c("A","A","A","A","B","B","B")
event <- data.frame(DATE, ARTICLE, COMP)
event
# DATE ARTICLE COMP
# 1 02.01.2000 blabla11 A
# 2 03.01.2000 blabla12 A
# 3 06.01.2000 blabla13 A
# 4 09.01.2000 blabla14 A
# 5 06.01.2000 blabla21 B
# 6 07.01.2000 blabla22 B
# 7 09.01.2000 blabla23 B
输出应该是我的data.frame事件,其中有两个附加列DATEm1和DATEp1
# DATE DATEm1 DATEp1 ARTICLE COMP
# 1 02.01.2000 01.01.2000 03.01.2000 blabla11 A
# 2 03.01.2000 02.01.2000 06.01.2000 blabla12 A
# 3 06.01.2000 03.01.2000 07.01.2000 blabla13 A
# 4 09.01.2000 07.01.2000 10.01.2000 blabla14 A
# 5 06.01.2000 04.01.2000 07.01.2000 blabla21 B
# 6 07.01.2000 06.01.2000 09.01.2000 blabla22 B
# 7 09.01.2000 07.01.2000 10.01.2000 blabla23 B
我在G. Grothendieck的答案中尝试了这种方法,这对于这个例子非常有效。
问题是,我的原始data.frame包含比这个例子更多的数据,而sqldf方法相当慢并且使用了大量内存(对于我的机器而言太多)。有人有另一个解决方案吗?
答案 0 :(得分:8)
我尝试了一种使用embed
和data.table
的方法。使用提供的示例数据进行测试,它与其他data.table
方法竞争(参见下面的基准测试),但仍然有点慢。 embed
方法在延伸到额外滞后时可能会更快,但我不确定这是否相关。
无论如何,我把(截至目前)的答案放在一起,并比较时间和输出。我不知道确切的输出对你有多大影响(例如,我在基准测试中丢失了一些时间,我不得不转储RET列),但要注意不同的答案在输出格式/内容。所有方法都提供与您期望的输出格式类似的结果。
我想知道不同的方法是否针对不同大小的data.frames进行了不同的缩放...如果你测试这些,我很想知道哪个对你和你的数据最快! :)
library("data.table")
library("sqldf")
library("microbenchmark")
# ========
# = Data =
# ========
DATE <- c("01.01.2000", "02.01.2000", "03.01.2000", "06.01.2000", "07.01.2000", "09.01.2000", "10.01.2000", "01.01.2000", "02.01.2000", "04.01.2000", "06.01.2000", "07.01.2000", "09.01.2000", "10.01.2000")
RET <- c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12)
COMP <- c("A","A","A","A","A","A","A","B","B","B","B","B","B","B")
df0 <- data.frame(DATE, RET, COMP)
DATE <- c("02.01.2000","03.01.2000","06.01.2000","09.01.2000","06.01.2000","07.01.2000","09.01.2000")
ARTICLE <- c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23")
COMP <- c("A","A","A","A","B","B","B")
event0 <- data.frame(DATE, ARTICLE, COMP)
# ==================
# = rbatt function =
# ==================
# Devations from desired format:
# 1) column order (COMP is first instead of last, otherwise correct order)
m2l <- function(x) split(x, rep(1:ncol(x), each = nrow(x))) # Thanks to https://stackoverflow.com/a/6823557/2343633
e2 <- function(x, d=1) m2l(rbind(matrix(NA, ncol=d, nrow=d-1), embed(x,d)))
testRB <- function(df=df0, event=event0){
dt1 <- as.data.table(df)
dt1[,DATE:=as.character(DATE)]
dt1[,c("DATEp1","DATE","DATEm1"):=e2(DATE,3),by=COMP]
dt1[,RET:=NULL]
setkey(dt1, COMP, DATE, DATEp1, DATEm1)
dt2 <- as.data.table(event)
dt2[,DATE:=as.character(DATE)]
setkey(dt2,COMP,DATE)
# below is slightly slower than doing dt1[,RET:=NULL] then dt <- dt1[dt2]
# dt <- dt1[dt2, list(DATEp1, DATEm1, ARTICLE)] # join
dt <- dt1[dt2]
dt
}
rbatt输出:
# COMP DATE DATEp1 DATEm1 ARTICLE
#1: A 02.01.2000 03.01.2000 01.01.2000 blabla11
#2: A 03.01.2000 06.01.2000 02.01.2000 blabla12
#3: A 06.01.2000 07.01.2000 03.01.2000 blabla13
#4: A 09.01.2000 10.01.2000 07.01.2000 blabla14
#5: B 06.01.2000 07.01.2000 04.01.2000 blabla21
#6: B 07.01.2000 09.01.2000 06.01.2000 blabla22
#7: B 09.01.2000 10.01.2000 07.01.2000 blabla23
已编辑 - DA优化#1(旧代码已注释掉)
已编辑 - DA优化#2(旧代码已注释,标记为版本)
# ===========================
# = David Arenburg function =
# ===========================
# https://stackoverflow.com/a/23483865/2343633
# Devations from desired format:
# 1) column order
#2)格式DATE,DATEm1,DATEp1
testDA <- function(df=df0, event=event0){
# Original DA below:
# df$DATE <- as.Date(strptime(as.character(df$DATE), format = "%m.%d.%Y"))
# event$DATE <- as.Date(strptime(as.character(event$DATE), format = "%m.%d.%Y"))
#
# ## Making sure "df" is sorted. If your data sets are already ordered you can skip the ordering both here and in the `setDT`
# df <- df[order(df$COMP, df$DATE), ]
#
# library(data.table)
# DT <- setDT(event)[order(COMP, DATE), list(
# DATEm1 = df[match(DATE, df$DATE) - 1, "DATE"],
# DATEp1 = df[match(DATE, df$DATE) + 1, "DATE"]
# ), by = c("ARTICLE", "DATE", "COMP")]
# DT
# Optimization #1:
# event$DATE <- as.character(event$DATE) # converting event$DATE to character (if it is already a character, better to skip this part)
# tempdf <- as.character(data.table(df, key = c("COMP", "DATE"))$DATE) # sorting and conerting df$DATE to character too so they will match
# setDT(event)[order(COMP, DATE), `:=` (
# DATEm1 = tempdf[match(DATE, tempdf) - 1],
# DATEp1 = tempdf[match(DATE, tempdf) + 1]
# ), by = c("DATE", "COMP")]
# event
# Optimization #2
# library(data.table) # loading data.table pckg
tempdf <- data.table(df, key = c("COMP", "DATE"))$DATE # sorting df and taking only the dates for speed
setDT(event)[order(COMP, DATE), `:=` (
DATEm1 = tempdf[match(DATE, tempdf) - 1],
DATEp1 = tempdf[match(DATE, tempdf) + 1]
)]
event
}
David Arenburg输出:
为DA优化#1编辑(#2可能被窃听)
注意第7行中的错误内容&#34; DATEm1&#34;,月份应为04
# > testDA()
# DATE ARTICLE COMP DATEm1 DATEp1
# 1: 02.01.2000 blabla11 A 01.01.2000 03.01.2000
# 2: 03.01.2000 blabla12 A 02.01.2000 06.01.2000
# 3: 06.01.2000 blabla13 A 03.01.2000 07.01.2000
# 4: 09.01.2000 blabla14 A 07.01.2000 10.01.2000
# 5: 06.01.2000 blabla21 B 03.01.2000 07.01.2000
# 6: 07.01.2000 blabla22 B 06.01.2000 09.01.2000
# 7: 09.01.2000 blabla23 B 07.01.2000 10.01.2000
# ============================
# = G. Grothendieck function =
# ============================
# https://stackoverflow.com/a/23415033/2343633
# Deviations from desired format:
# 1) format of DATE, DATEm1, DATEp1
testGG <- function(df=df0, event=event0){
# ensure that dates sort correctly by converting to yyyy-mm-dd
df2 <- transform(df, DATE = format(as.Date(DATE, "%m.%d.%Y")))
event2 <- transform(event, DATE = format(as.Date(DATE, "%m.%d.%Y")))
result <- sqldf(c("create index i on df2(COMP, DATE)",
"select
event.DATE,
max(A.DATE) DATEm1,
min(B.DATE) DATEp1,
event.ARTICLE,
event.COMP
from event2 event, main.df2 A, main.df2 B
on event.COMP = A.COMP and event.COMP = B.COMP
and event.DATE > A.DATE and event.DATE < B.DATE
group by event.DATE, event.COMP
order by event.COMP, event.DATE"))
result
}
GG输出:
# DATE DATEm1 DATEp1 ARTICLE COMP
# 1 2000-02-01 2000-01-01 2000-03-01 blabla11 A
# 2 2000-03-01 2000-02-01 2000-06-01 blabla12 A
# 3 2000-06-01 2000-03-01 2000-07-01 blabla13 A
# 4 2000-09-01 2000-07-01 2000-10-01 blabla14 A
# 5 2000-06-01 2000-04-01 2000-07-01 blabla21 B
# 6 2000-07-01 2000-06-01 2000-09-01 blabla22 B
# 7 2000-09-01 2000-07-01 2000-10-01 blabla23 B
# =================
# = Arun function =
# =================
# https://stackoverflow.com/a/23484292/2343633
# Deviations from desired format:
# 1) Column order (COMP first, ARTICLE does not come after DATEm1)
testAR <- function(df=df0, event=event0){
dt1 = as.data.table(df)
dt2 = as.data.table(event)
key_cols = c("COMP", "DATE")
setcolorder(dt2, c(key_cols, setdiff(names(dt2), key_cols)))
setkeyv(dt1, key_cols)
idx1 = dt1[dt2, which=TRUE, mult="first"]-1L
idx2 = dt1[dt2, which=TRUE, mult="last"]+1L
idx1[idx1 == 0L] = NA
dt2[, `:=`(DATEm1 = dt1$DATE[idx1],
DATEp1 = dt1$DATE[idx2]
)]
dt2
}
Arun输出:
# COMP DATE ARTICLE DATEm1 DATEp1
# 1: A 02.01.2000 blabla11 01.01.2000 03.01.2000
# 2: A 03.01.2000 blabla12 02.01.2000 06.01.2000
# 3: A 06.01.2000 blabla13 03.01.2000 07.01.2000
# 4: A 09.01.2000 blabla14 07.01.2000 10.01.2000
# 5: B 06.01.2000 blabla21 04.01.2000 07.01.2000
# 6: B 07.01.2000 blabla22 06.01.2000 09.01.2000
# 7: B 09.01.2000 blabla23 07.01.2000 10.01.2000
编辑 - 请注意,这是原始基准(原始代码,原始OP数据集)
# =============
# = Benchmark =
# =============
microbenchmark(testAR(), testDA(), testRB(), testGG())
# Unit: milliseconds
# expr min lq median uq max neval
# testAR() 3.220278 3.414430 3.509251 3.626438 7.209494 100
# testDA() 4.273542 4.471227 4.569370 4.752857 6.460922 100
# testRB() 5.704559 5.981680 6.135946 6.457392 14.309858 100
# testGG() 22.337065 23.064494 23.964581 24.622467 50.934712 100
请注意,我从这个基准b / c中删除testGG()
它的速度要慢得多(我对几个中间数据集进行了一些测试,并且tetGG()
比其他3种方法更差)。< / p>
# ========
# = Data =
# ========
mos <- c("01","02","03","06","07","09","10", "01", "02", "04", "06", "07", "09", "10")
yrs <- 1920:2020
DATE <- paste(mos, "01", rep(yrs, each=length(mos)), sep=".")
RET <- rep(c(-2.0,1.1,3,1.4,-0.2, 0.6, 0.1, -0.21, -1.2, 0.9, 0.3, -0.1,0.3,-0.12), length(yrs))
COMP <- rep(c("A","A","A","A","A","A","A","B","B","B","B","B","B","B"), length(yrs))
df0 <- data.frame(DATE, RET, COMP)
mos2 <- c("02","03","06","09","06","07","09")
DATE <- paste(mos2, "01", rep(yrs, each=length(mos2)), sep=".")
ARTICLE <- rep(c("blabla11", "blabla12","blabla13","blabla14","blabla21","blabla22","blabla23"), length(yrs))
COMP <- rep(c("A","A","A","A","B","B","B"), length(yrs))
event0 <- data.frame(DATE, ARTICLE, COMP)
编辑 - 大型数据集的原始基准:
# > microbenchmark(testAR(), testDA(), testRB(), times=100)
# Unit: milliseconds
# expr min lq median uq max neval
# testAR() 3.458217 3.696698 3.934349 4.697033 6.584214 100
# testDA() 143.180409 148.916461 151.776002 155.219515 237.524369 100
# testRB() 7.279168 7.636102 8.073778 8.828537 11.143111 100
编辑 - DA优化后的大型数据集基准#1:
# > microbenchmark(testAR(), testDA(), testRB(), times=100)
# Unit: milliseconds
# expr min lq median uq max neval
# testAR() 3.198266 3.440739 3.605723 3.788199 22.52867 100
# testDA() 56.290346 59.528819 60.821921 64.580825 80.99480 100
# testRB() 6.763570 7.200741 7.400343 7.748849 20.97527 100
编辑 - DA优化后的大数据集基准#2:
注意 - 将更新#2发送到testDA()
的警告# > microbenchmark(testAR(), testDA(), testRB(), times=100)
# Unit: milliseconds
# expr min lq median uq max neval
# testAR() 3.423508 6.055584 6.246517 6.333444 7.653360 100
# testDA() 2.665558 3.961070 4.062354 4.139571 8.427439 100
# testRB() 6.421328 6.669137 6.877517 6.966977 8.271469 100
# There were 50 or more warnings (use warnings() to see the first 50)
# > warnings()[1]
# Warning message:
# In `[.data.table`(dt2, , `:=`(DATEm1 = dt1$DATE[idx1], ... :
# Invalid .internal.selfref detected and fixed by taking a copy of the whole table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed.
分析代码
Rprof("testAR.out", memory.profiling=TRUE)
for(i in 1:50){
arAns <- testAR()
}
Rprof(NULL)
Rprof("testDA.out", memory.profiling=TRUE)
for(i in 1:50){
daAns <- testDA()
}
Rprof(NULL)
Rprof("testRB.out", memory.profiling=TRUE)
for(i in 1:50){
rbAns <- testRB()
}
Rprof(NULL)
testAR()个人资料结果
# > summaryRprof("testAR.out", memory="both")$by.self
# self.time self.pct total.time total.pct mem.total
# "[[" 0.02 10 0.06 30 8.3
# "head" 0.02 10 0.04 20 12.1
# "nrow" 0.02 10 0.04 20 10.6
# ".Call" 0.02 10 0.02 10 8.2
# ".row_names_info" 0.02 10 0.02 10 8.4
# "<Anonymous>" 0.02 10 0.02 10 8.3
# "key" 0.02 10 0.02 10 0.0
# "levels.default" 0.02 10 0.02 10 0.0
# "match" 0.02 10 0.02 10 0.0
# "stopifnot" 0.02 10 0.02 10 4.2
testDA()个人资料结果
# > summaryRprof("testDA.out", memory="both")$by.self
# self.time self.pct total.time total.pct mem.total
# "match" 2.04 26.56 2.34 30.47 94.2
# "[.data.frame" 1.78 23.18 6.50 84.64 295.3
# "NextMethod" 0.80 10.42 0.80 10.42 33.9
# "strptime" 0.42 5.47 0.46 5.99 25.9
# "[" 0.34 4.43 7.14 92.97 335.9
# "[.Date" 0.34 4.43 1.14 14.84 49.8
# "names" 0.34 4.43 0.34 4.43 17.9
# "%in%" 0.28 3.65 1.44 18.75 50.3
# "dim" 0.28 3.65 0.30 3.91 13.9
# "order" 0.16 2.08 0.18 2.34 1.7
# "$" 0.16 2.08 0.16 2.08 7.0
# ".Call" 0.14 1.82 6.76 88.02 308.4
# "length" 0.14 1.82 0.14 1.82 4.6
# "sys.call" 0.14 1.82 0.14 1.82 5.6
# "<Anonymous>" 0.04 0.52 0.04 0.52 9.5
# "as.Date.POSIXlt" 0.04 0.52 0.04 0.52 3.4
# "getwd" 0.04 0.52 0.04 0.52 9.5
# "do.call" 0.02 0.26 0.18 2.34 1.7
# "assign" 0.02 0.26 0.04 0.52 0.1
# ".subset2" 0.02 0.26 0.02 0.26 6.1
# "all" 0.02 0.26 0.02 0.26 0.0
# "file.info" 0.02 0.26 0.02 0.26 0.0
# "is.data.table" 0.02 0.26 0.02 0.26 0.0
# "lockBinding" 0.02 0.26 0.02 0.26 0.1
# "parent.frame" 0.02 0.26 0.02 0.26 0.0
# "pmatch" 0.02 0.26 0.02 0.26 0.0
# "which" 0.02 0.26 0.02 0.26 6.5
testRB()个人资料结果
# > summaryRprof("testRB.out", memory="both")$by.self
# self.time self.pct total.time total.pct mem.total
# "sort.list" 0.04 9.52 0.06 14.29 21.5
# "length" 0.04 9.52 0.04 9.52 0.0
# "pmatch" 0.04 9.52 0.04 9.52 13.9
# "[.data.table" 0.02 4.76 0.42 100.00 71.8
# ".Call" 0.02 4.76 0.12 28.57 39.6
# "split.default" 0.02 4.76 0.10 23.81 32.9
# "alloc.col" 0.02 4.76 0.08 19.05 13.3
# "[[" 0.02 4.76 0.04 9.52 6.9
# "cedta" 0.02 4.76 0.04 9.52 0.0
# "lapply" 0.02 4.76 0.04 9.52 0.0
# "[[.data.frame" 0.02 4.76 0.02 4.76 6.9
# "as.character" 0.02 4.76 0.02 4.76 6.0
# "as.name" 0.02 4.76 0.02 4.76 5.3
# "attr" 0.02 4.76 0.02 4.76 0.0
# "exists" 0.02 4.76 0.02 4.76 0.0
# "FUN" 0.02 4.76 0.02 4.76 0.0
# "intersect" 0.02 4.76 0.02 4.76 6.5
# "is.data.table" 0.02 4.76 0.02 4.76 0.0
据我所知,Arun的答案是速度最快,内存效率最高的。 rbatt回答比数据集大小更好地扩展得比DA的回答 - 我最初的猜测是使用POSIX或Date类的方法可能无法很好地扩展,但我不确定分析结果是否支持这种预感。如果有人认为这会有所帮助,我可以提供完整的个人资料结果,而不仅仅是$by.self
部分。
另外值得注意的是,花费的时间和使用的内存在方法之间是负相关的 - 最快的方法使用最少的内存。
答案 1 :(得分:4)
这可以通过sqldf中的三重连接来完成:
library(sqldf)
# ensure that dates sort correctly by converting to yyyy-mm-dd
df2 <- transform(df, DATE = format(as.Date(DATE, "%m.%d.%Y")))
event2 <- transform(event, DATE = format(as.Date(DATE, "%m.%d.%Y")))
result <- sqldf(c("create index i on df2(COMP, DATE)",
"select
event.DATE,
max(A.DATE) DATEm1,
min(B.DATE) DATEp1,
event.ARTICLE,
event.COMP
from event2 event, main.df2 A, main.df2 B
on event.COMP = A.COMP and event.COMP = B.COMP
and event.DATE > A.DATE and event.DATE < B.DATE
group by event.DATE, event.COMP
order by event.COMP, event.DATE"))
,并提供:
> result
DATE DATEm1 DATEp1 ARTICLE COMP
1 2000-02-01 2000-01-01 2000-03-01 blabla11 A
2 2000-03-01 2000-02-01 2000-06-01 blabla12 A
3 2000-06-01 2000-03-01 2000-07-01 blabla13 A
4 2000-09-01 2000-07-01 2000-10-01 blabla14 A
5 2000-06-01 2000-04-01 2000-07-01 blabla21 B
6 2000-07-01 2000-06-01 2000-09-01 blabla22 B
7 2000-09-01 2000-07-01 2000-10-01 blabla23 B
答案 2 :(得分:4)
library(data.table) # loading data.table pckg
tempdf <- data.table(df, key = c("COMP", "DATE")) # Sorting df
DATEVEC <- tempdf$DATE # Creating DATE vector to choose from
Key <- paste(DATEVEC, tempdf$COMP) # Creating Key vector for matching
setDT(event)[order(COMP, DATE), `:=`(
DATEm1 = DATEVEC[match(paste(DATE, COMP), Key) - 1],
DATEp1 = DATEVEC[match(paste(DATE, COMP), Key) + 1]
)]
event
# DATE ARTICLE COMP DATEm1 DATEp1
# 1: 02.01.2000 blabla11 A 01.01.2000 03.01.2000
# 2: 03.01.2000 blabla12 A 02.01.2000 06.01.2000
# 3: 06.01.2000 blabla13 A 03.01.2000 07.01.2000
# 4: 09.01.2000 blabla14 A 07.01.2000 10.01.2000
# 5: 06.01.2000 blabla21 B 04.01.2000 07.01.2000
# 6: 07.01.2000 blabla22 B 06.01.2000 09.01.2000
# 7: 09.01.2000 blabla23 B 07.01.2000 10.01.2000
另一种方式
tempdf <- data.table(df, key = c("COMP", "DATE")) # Sorting df
DATEVEC <- tempdf$DATE # Creating DATE vector to choose from
Keydf <- paste(DATEVEC, tempdf$COMP) # Creating Key vector for matching
event <- data.table(event, key = c("COMP", "DATE")) # Sorting event
event$Keyev <- paste(event$DATE, event$COMP) # Creating Key vector for matching
event[, `:=`(
DATEm1 = DATEVEC[match(Keyev, Keydf) - 1],
DATEp1 = DATEVEC[match(Keyev, Keydf) + 1]
)]
event
# DATE ARTICLE COMP Keyev DATEm1 DATEp1
# 1: 02.01.2000 blabla11 A 02.01.2000 A 01.01.2000 03.01.2000
# 2: 03.01.2000 blabla12 A 03.01.2000 A 02.01.2000 06.01.2000
# 3: 06.01.2000 blabla13 A 06.01.2000 A 03.01.2000 07.01.2000
# 4: 09.01.2000 blabla14 A 09.01.2000 A 07.01.2000 10.01.2000
# 5: 06.01.2000 blabla21 B 06.01.2000 B 04.01.2000 07.01.2000
# 6: 07.01.2000 blabla22 B 07.01.2000 B 06.01.2000 09.01.2000
# 7: 09.01.2000 blabla23 B 09.01.2000 B 07.01.2000 10.01.2000
第三种方式
setDT(df) # Making df adata.table
setkey(df, COMP, DATE) # Sorting df
DATEVEC <- df$DATE # Creating DATE vector to choose from
Keydf <- paste(DATEVEC, df$COMP) # Creating Key vector for matching
setDT(event) # Making event a data.table
setkey(event, COMP, DATE) # Sorting event
event$Keyev <- paste(event$DATE, event$COMP) # Creating Key vector for matching
event[, `:=`(
DATEm1 = DATEVEC[match(Keyev, Keydf) - 1],
DATEp1 = DATEVEC[match(Keyev, Keydf) + 1]
)]
event
# DATE ARTICLE COMP Keyev DATEm1 DATEp1
# 1: 02.01.2000 blabla11 A 02.01.2000 A 01.01.2000 03.01.2000
# 2: 03.01.2000 blabla12 A 03.01.2000 A 02.01.2000 06.01.2000
# 3: 06.01.2000 blabla13 A 06.01.2000 A 03.01.2000 07.01.2000
# 4: 09.01.2000 blabla14 A 09.01.2000 A 07.01.2000 10.01.2000
# 5: 06.01.2000 blabla21 B 06.01.2000 B 04.01.2000 07.01.2000
# 6: 07.01.2000 blabla22 B 07.01.2000 B 06.01.2000 09.01.2000
# 7: 09.01.2000 blabla23 B 09.01.2000 B 07.01.2000 10.01.2000
答案 3 :(得分:4)
这是使用data.table
的另一种方法:
首先,我们将df
和event
转换为data.table
s。在这里,我将使用as.data.table(.)
。但如果您不想复制,可以使用setDT
。也就是说,通过执行setDT(df)
,df
将通过引用data.table
进行修改。
require(data.table) ## >= 1.9.2
dt1 = as.data.table(df)
dt2 = as.data.table(event)
然后我们将按如下方式准备数据:
key_cols = c("COMP", "DATE")
setcolorder(dt2, c(key_cols, setdiff(names(dt2), key_cols)))
setkeyv(dt1, key_cols)
setcolorder
通过引用重新排列data.tables的列。 setkeyv
按给定列按升序对data.table进行排序,并标记dt1
的键列。
列重新排序在这里是必不可少的,因为我们没有在dt2
上设置键(因为这将排序dt2
,这对您来说可能是不受欢迎的)。由于没有为dt2
设置密钥,因此data.table会从dt2
获取第一个'n'(= 2个)列,以匹配dt1
中的键列。
注意:使用data.table的联接
x[i]
绝对需要设置x
的密钥。这里x = dt1
。i
上的设置键是可选的,具体取决于您是否希望保留订单。
现在,我们执行两个连接并获得相应的匹配索引:
idx1 = dt1[dt2, which=TRUE, mult="first"]-1L
idx2 = dt1[dt2, which=TRUE, mult="last"]+1L
dt2
中dt1
的每个匹配都会获得第一个联接,这是dt1
中的第一个匹配位置。同样,第二个联接会获得dt2
中dt1
的每个匹配,这是dt1
中的最后一个匹配位置。我们分别加上-1和+1来获得前一个和下一个索引。
照顾一个特例:
idx1[idx1 == 0L] = NA
当匹配索引为1时,减去它将导致0.由于R在0索引上的行为,我们在这里明确地用NA替换它。
现在,我们可以将这些日期进行子集化,并通过引用将其添加到dt2
,如下所示:
dt2[, `:=`(DATEm1 = dt1$DATE[idx1],
DATEp1 = dt1$DATE[idx2]
)]
# COMP DATE ARTICLE DATEm1 DATEp1
# 1: A 02.01.2000 blabla11 01.01.2000 03.01.2000
# 2: A 03.01.2000 blabla12 02.01.2000 06.01.2000
# 3: A 06.01.2000 blabla13 03.01.2000 07.01.2000
# 4: A 09.01.2000 blabla14 07.01.2000 10.01.2000
# 5: B 06.01.2000 blabla21 04.01.2000 07.01.2000
# 6: B 07.01.2000 blabla22 06.01.2000 09.01.2000
# 7: B 09.01.2000 blabla23 07.01.2000 10.01.2000