我有一个policyData,这是我非常庞大的数据集(数百万行),我希望通过映射表(数万行)向其中添加一些信息。
示例:
policyData <- data.table(plan=c("c","b","b","d"),v=c(8,7,5,6),foo=c(4,2,8,3))
mapping <- data.table(plan=c("b","b","a","a","c","c"),a=c(1,2,4,5,7,8),b=c(9,8,6,5,3,2))
policyData:
plan v foo
1: c 8 4
2: b 7 2
3: b 5 8
4: d 6 3
映射:
plan a b
1: b 1 9
2: b 2 8
3: a 4 6
4: a 5 5
5: c 7 3
6: c 8 2
问题是该映射具有多个实例,我希望仅获得第一个匹配项。而且我需要使用:=
使用内存高效的方式将两者结合起来。
所需的输出是:
plan v foo a b
1: c 8 4 7 3
2: b 7 2 1 9
3: b 5 8 1 9
4: d 6 3 NA NA
我尝试过:
policyData[mapping, on="plan", `:=`(a=i.a, b=i.b)]
给出映射表中的最后一个实例:
plan v foo a b
1: c 8 4 8 2
2: b 7 2 2 8
3: b 5 8 2 8
4: d 6 3 NA NA
我也尝试过:
policyData[mapping, on="plan", `:=`(a=i.a, b=i.b), mult="first"]
给出奇怪的结果(第二个“ b”与映射不匹配):
plan v foo a b
1: c 8 4 8 2
2: b 7 2 2 8
3: b 5 8 NA NA
4: d 6 3 NA NA
任何见解都会有所帮助。我已经做了很多搜索。
答案 0 :(得分:5)
只需将mapping
与mapping[, .SD[1], by = plan]
进行汇总,然后将其用于加入:
policyData[mapping[, .SD[1], by = plan]
, on = .(plan)
, `:=` (a = i.a, b = i.b)]
给出所需的输出:
> policyData plan v foo a b 1: c 8 4 7 3 2: b 7 2 1 9 3: b 5 8 1 9 4: d 6 3 NA NA
答案 1 :(得分:2)
建议另一种选择:
policyData[, c("a", "b") := mapping[.SD, on="plan", .(a, b), mult="first"]]
采样数据以匹配OP的尺寸:
library(data.table)
set.seed(0L)
nrDS <- 100e6
nrMap <- 90e3
policyData <- data.table(plan=sample(letters,nrDS,TRUE),v=rnorm(nrDS),foo=rnorm(nrDS))
mapping <- data.table(plan=sample(letters,nrMap,TRUE),a=rnorm(nrMap),b=rnorm(nrMap))
内存配置文件:
library(bench)
mark(mtd1=policyData[mapping[mapping[, .I[1L], by = plan]$V1], on = .(plan), `:=` (a = i.a, b = i.b)],
mtd2=policyData[, c("a", "b") := mapping[.SD, on="plan", .(a, b), mult="first"]],
mtd3=policyData[unique(mapping, by="plan"), on=.(plan), `:=` (a=i.a, b=i.b)])
内存配置文件输出:
# A tibble: 3 x 14
expression min mean median max `itr/sec` mem_alloc n_gc n_itr total_time result memory time gc
<chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <bch:tm> <list> <list> <list> <list>
1 mtd1 7.07s 7.07s 7.07s 7.07s 0.141 4.74GB 0 1 7.07s <data.table [90,000,000 x 5~ <Rprofmem [31,589 x 3]> <bch:t~ <tibble [1 x 3~
2 mtd2 6.73s 6.73s 6.73s 6.73s 0.149 5.03GB 1 1 6.73s <data.table [90,000,000 x 5~ <Rprofmem [20 x 3]> <bch:t~ <tibble [1 x 3~
3 mtd3 7.68s 7.68s 7.68s 7.68s 0.130 3.35GB 1 1 7.68s <data.table [90,000,000 x 5~ <Rprofmem [23 x 3]> <bch:t~ <tibble [1 x 3~
休方法是内存效率最高的,而mtd2最快。与生活中的大多数事情一样,您需要进行权衡。