我有一些大型数据集,我正在尝试data.table
组合它们,同时在匹配行上总结共享列。我知道如何在LHS data.table中使用[
匹配行进行合并,如下所示,使用表a2
:LHS和a
:RHS
a2 <- data.table( b= c(letters[1:5],letters[11:15]), c = as.integer(rep(100,10)))
a <- data.table(b = letters[1:10], c = as.integer(1:10))
setkey(a2 ,"b")
setkey(a , "b")
a2
b c
1: a 100
2: b 100
3: c 100
4: d 100
5: e 100
6: k 100
7: l 100
8: m 100
9: n 100
10: o 100
a
b c
1: a 1
2: b 2
3: c 3
4: d 4
5: e 5
6: f 6
7: g 7
8: h 8
9: i 9
10: j 10
来自第二个答案Merge data frames whilst summing common columns in R我看到了如何在匹配的行中总结列,如下:
setkey(a , "b")
setkey(a2, "b")
a2[a, `:=`(c = c + i.c)]
a2
b c
1: a 101
2: b 102
3: c 103
4: d 104
5: e 105
6: k 100
7: l 100
8: m 100
9: n 100
10: o 100
但是我正在尝试保留不匹配的行。
或者我可以使用merge
,如下所示,但我想在创建一个包含4行的新表之前将其减少为2行。
c <- merge(a, a2, by = "b", all=T)
c <- transform(c, value = rowSums(c[,2:3], na.rm=T))
c <- c[,c(1,4)]
c
b value
1: a 102
2: b 104
3: c 106
4: d 108
5: e 110
6: f 6
7: g 7
8: h 8
9: i 9
10: j 10
11: k 100
12: l 100
13: m 100
14: n 100
15: o 100
这是我想要实现的最后一张表,在此先感谢。
答案 0 :(得分:2)
merge
可能效率不高。由于你的两个data.table
具有相同的结构,我建议rbind
将它们放在一起并按其关键字求和。换句话说:
rbindlist(list(a, a2))[, sum(c), b]
我已经使用了rbindlist
,因为rbind
data.table
时通常效率更高(即使您必须首先将data.table
放入list
library(data.table)
library(stringi)
set.seed(1)
n <- 1e7; n2 <- 1e6
x <- stri_rand_strings(n, 4)
a2 <- data.table(b = sample(x, n2), c = sample(100, n2, TRUE))
a <- data.table(b = sample(x, n2), c = sample(10, n2, TRUE))
system.time(rbindlist(list(a, a2))[, sum(c), b])
# user system elapsed
# 0.83 0.05 0.87
system.time(merge(a2, a, by = "b", all = TRUE)[, rowSums(.SD, na.rm = TRUE), b]) # Get some coffee
# user system elapsed
# 159.58 0.48 162.95
## Do we have all the rows we expect to have?
length(unique(c(a$b, a2$b)))
# [1] 1782166
nrow(rbindlist(list(a, a2))[, sum(c), b])
# [1] 1782166
)。
比较较大数据集的某些时间:
Build Action: Content