我有一个非常大的data.table(1e7乘50列)。除了密钥之外,列是合乎逻辑的。这是一个迷你版。
id time drugA drugB
1 1 TRUE FALSE
1 2 TRUE FALSE
1 3 FALSE FALSE
1 4 TRUE FALSE
2 1 FALSE TRUE
2 2 FALSE FALSE
2 3 FALSE FALSE
2 4 FALSE FALSE
看起来像这样
kv <- c(drugA=1, drugB=2)
我有一个键值&#39;字典&#39;
id time drugA drugB
1 1 1 NA
1 2 1 NA
1 3 NA NA
1 4 1 NA
2 1 NA 2
2 2 NA NA
2 3 NA NA
2 4 NA NA
我想使用这个字典将逻辑列中的值替换为&#39;字典中的值&#39;。输出应该如下所示。
library(microbenchmark)
d.orig <- data.table(
id=c(rep(1:2,each=1e7)),
time=c(1e7,1e7),
drugA=sample(c(T,F), 2e7, replace=T),
drugB=sample(c(T,F), 2e7, replace=T)
)
# Solution 1
foo1 <- function() {
d.in <- data.table::copy(d.orig)
d.in[, names(kv) := lapply(names(kv), function(x) {
gx <- get(x)
replace(NA_real_[seq_along(gx)], gx, kv[x])
})]
}
# Solution 2
dt_kv <- data.table(drug = c("drugA","drugB"), value = c(1,2))
foo2 <- function() {
d.in <- data.table::copy(d.orig)
d.in <- melt(d.in, id.vars = c("id", "time"))[
dt_kv, on = c(variable = "drug"), nomatch = 0][
value == FALSE, i.value := NA]
dcast(d.in, formula = id + time ~ variable, value.var = "i.value")
}
# Solution 3
kDT = data.table(variable = names(kv), value = TRUE, v = unname(kv))
foo3 <- function() {
d.in <- data.table::copy(d.orig)
DT = melt(d.in, id=c("id","time"))
DT[kDT, on=.(variable, value), v := i.v ]
dcast(DT, formula = id + time ~ variable, value.var = 'v')
}
最有效(最快)的方法是什么?
我已尝试过以下解决方案,但无法找到重大差异(虽然我不确定我的比较方法是否有效)。
print(microbenchmark(foo1, times=1e4))
Unit: nanoseconds
expr min lq mean median uq max neval
foo1 33 50 85.8657 55 58 56717 10000
print(microbenchmark(foo2, times=1e4))
Unit: nanoseconds
expr min lq mean median uq max neval
foo2 29 48 70.8304 52 55 57644 10000
print(microbenchmark(foo3, times=1e4))
Unit: nanoseconds
expr min lq mean median uq max neval
foo3 30 36 61.1542 41 48 58015 10000
哪个产生(尽管仍有很多变化)
using (SqlBulkCopy bcp = new SqlBulkCopy(YourConnectionString))
{
// +1 to Marc Gravell for this neat little library to do the mapping for us
// because DataTable isn't available until .NET Standard Library 2.0
using (var dataReader = ObjectReader.Create(yourListOfObjects,
nameof(YourClass.Property1),
nameof(YourClass.Property2)))
{
bcp.DestinationTableName = "YourTableNameInSQL";
bcp.ColumnMappings.Add(new SqlBulkCopyColumnMapping("Property1", "MyCorrespondingTableColumn"));
bcp.ColumnMappings.Add(new SqlBulkCopyColumnMapping("Property2", "TableProperty2"));
await bcp.WriteToServerAsync(dataReader).ConfigureAwait(false);
}
}
答案 0 :(得分:4)
常见的方法是将数据移动到长格式:
DT = melt(d.in, id=c("id","time"))
然后将映射放在表中,类似于@ SymbolixAU的答案:
kDT = data.table(variable = names(kv), value = TRUE, v = unname(kv))
然后使用映射进行更新连接,通过引用添加新列:
DT[kDT, on=.(variable, value), v := i.v ]
一般来说,我认为如果您非常关心速度或简单的语法,那么您需要长格式数据而不是R中的宽数据,所以我会跳过最后的dcast
步骤(参见@ SymbolixAU的答案)
答案 1 :(得分:3)
不确定它是否是最有效的方式,但你可以做到
d.in[, names(kv) := lapply(names(kv), function(x) {
gx <- get(x)
replace(NA_real_[seq_along(gx)], gx, kv[x])
})]
在这里,我们使用kv
迭代get
中的名称以检索列值。然后我们用kv
值替换新创建的NA矢量的相关值,得到
id time drugA drugB 1: 1 1 1 NA 2: 1 2 1 NA 3: 1 3 NA NA 4: 1 4 1 NA 5: 2 1 NA 2 6: 2 2 NA NA 7: 2 3 NA NA 8: 2 4 NA NA
答案 2 :(得分:2)
您可以根据需要创建data.table
查找词典,融合原始d.in
,加入,更新和重新塑造
dt_kv <- data.table(drug = c("drugA","drugB"),
value = c(1,2))
d.in <- melt(d.in, id.vars = c("id", "time"))[
dt_kv, on = c(variable = "drug"), nomatch = 0][
value == FALSE, i.value := NA]
dcast(d.in, formula = id + time ~ variable, value.var = "i.value")
# id time drugA drugB
# 1: 1 1 1 NA
# 2: 1 2 1 NA
# 3: 1 3 NA NA
# 4: 1 4 1 NA
# 5: 2 1 NA 2
# 6: 2 2 NA NA
# 7: 2 3 NA NA
# 8: 2 4 NA NA