合并data.tables保留最近的行或追加新的行

时间:2016-01-26 02:22:41

标签: r merge data.table

假设我有两个data.tables,我想在两个变量上合并,但是更新条目,其中另一列(时间)大于原始列。此外,它应该是完全连接,因此如果新数据中有新变量,则应追加它们。什么是这个问题的好方法?

示例:

## Initial data
dt1 <- data.table(user=c('a', 'a', 'b'), 
                   cell=c(1, 2, 1),
                   expires=as.POSIXct(rep('Jan 25 21:24', 3), format='%b %d %H:%M'))

## New data to update initial
dt2 <- data.table(user=c('a', 'c'), 
                 cell=c(1, 1),
                 expires=as.POSIXct(rep('Jan 25 21:59', 2), format='%b %d %H:%M'))

## Attempt
merge(dt1, dt2, by=c('user', 'cell'), all=TRUE)[
  , expires := pmax(expires.x, expires.y, na.rm=TRUE)][]

## Desired result: user a in cell 1 has been updated, user c has been added
(res <- rbindlist(list(dt2, dt1[2:3,]))[order(user, cell)])
#    user cell             expires
# 1:    a    1 2016-01-25 21:59:00
# 2:    a    2 2016-01-25 21:24:00
# 3:    b    1 2016-01-25 21:24:00
# 4:    c    1 2016-01-25 21:59:00

2 个答案:

答案 0 :(得分:3)

因为看起来你无论如何都需要在这里运行和外部连接(通常不是很有内存效率),只是运行rbind应该在计算上更便宜,然后只是一个简单的order(似乎利用data.tble forder data.table方法包含的unique unique(rbind(dt1, dt2)[order(-expires)], by = c("user", "cell")) # user cell expires # 1: a 1 2016-01-25 21:59:00 # 2: c 1 2016-01-25 21:59:00 # 3: a 2 2016-01-25 21:24:00 # 4: b 1 2016-01-25 21:24:00 },看起来很有希望

chmod +t file

答案 1 :(得分:2)

从我的角度来看,您已经接近解决方案了,您只需按照以下方式扩展您的连锁经营:

require(data.table)
dt1 <- data.table(user=c('a', 'a', 'b'), 
                  cell=c(1, 2, 1),
                  expires=as.POSIXct(rep(Sys.time(), 3)) )
# user cell             expires
# 1:    a    1 2016-01-26 11:19:49
# 2:    a    2 2016-01-26 11:19:49
# 3:    b    1 2016-01-26 11:19:49


## New data to update initial
dt2 <- data.table(user=c('a', 'c'), 
                  cell=c(1, 1),
                  expires=as.POSIXct(rep(Sys.time(), 2)) )
# user cell             expires
# 1:    a    1 2016-01-26 11:20:46
# 2:    c    1 2016-01-26 11:20:46

## Attempt
res_merge = merge(dt1, dt2, by=c('user', 'cell'), all=TRUE)[
  , expires := pmax(expires.x, expires.y, na.rm=TRUE)][, `:=`(expires.x=NULL,expires.y=NULL)][]

# user cell             expires
# 1:    a    1 2016-01-26 11:20:46
# 2:    a    2 2016-01-26 11:19:49
# 3:    b    1 2016-01-26 11:19:49
# 4:    c    1 2016-01-26 11:20:46