假设我有两个data.tables,我想在两个变量上合并,但是更新条目,其中另一列(时间)大于原始列。此外,它应该是完全连接,因此如果新数据中有新变量,则应追加它们。什么是这个问题的好方法?
示例:
## Initial data
dt1 <- data.table(user=c('a', 'a', 'b'),
cell=c(1, 2, 1),
expires=as.POSIXct(rep('Jan 25 21:24', 3), format='%b %d %H:%M'))
## New data to update initial
dt2 <- data.table(user=c('a', 'c'),
cell=c(1, 1),
expires=as.POSIXct(rep('Jan 25 21:59', 2), format='%b %d %H:%M'))
## Attempt
merge(dt1, dt2, by=c('user', 'cell'), all=TRUE)[
, expires := pmax(expires.x, expires.y, na.rm=TRUE)][]
## Desired result: user a in cell 1 has been updated, user c has been added
(res <- rbindlist(list(dt2, dt1[2:3,]))[order(user, cell)])
# user cell expires
# 1: a 1 2016-01-25 21:59:00
# 2: a 2 2016-01-25 21:24:00
# 3: b 1 2016-01-25 21:24:00
# 4: c 1 2016-01-25 21:59:00
答案 0 :(得分:3)
因为看起来你无论如何都需要在这里运行和外部连接(通常不是很有内存效率),只是运行rbind
应该在计算上更便宜,然后只是一个简单的order
(似乎利用data.tble
forder
data.table
方法包含的unique
unique(rbind(dt1, dt2)[order(-expires)], by = c("user", "cell"))
# user cell expires
# 1: a 1 2016-01-25 21:59:00
# 2: c 1 2016-01-25 21:59:00
# 3: a 2 2016-01-25 21:24:00
# 4: b 1 2016-01-25 21:24:00
},看起来很有希望
chmod +t file
答案 1 :(得分:2)
从我的角度来看,您已经接近解决方案了,您只需按照以下方式扩展您的连锁经营:
require(data.table)
dt1 <- data.table(user=c('a', 'a', 'b'),
cell=c(1, 2, 1),
expires=as.POSIXct(rep(Sys.time(), 3)) )
# user cell expires
# 1: a 1 2016-01-26 11:19:49
# 2: a 2 2016-01-26 11:19:49
# 3: b 1 2016-01-26 11:19:49
## New data to update initial
dt2 <- data.table(user=c('a', 'c'),
cell=c(1, 1),
expires=as.POSIXct(rep(Sys.time(), 2)) )
# user cell expires
# 1: a 1 2016-01-26 11:20:46
# 2: c 1 2016-01-26 11:20:46
## Attempt
res_merge = merge(dt1, dt2, by=c('user', 'cell'), all=TRUE)[
, expires := pmax(expires.x, expires.y, na.rm=TRUE)][, `:=`(expires.x=NULL,expires.y=NULL)][]
# user cell expires
# 1: a 1 2016-01-26 11:20:46
# 2: a 2 2016-01-26 11:19:49
# 3: b 1 2016-01-26 11:19:49
# 4: c 1 2016-01-26 11:20:46