使用字典替换data.table中的值的最有效方法是什么?

时间:2017-02-04 00:38:40

标签: r data.table

我有一个非常大的data.table(1e7乘50列)。除了密钥之外,列是合乎逻辑的。这是一个迷你版。

id  time    drugA   drugB
1   1   TRUE    FALSE
1   2   TRUE    FALSE
1   3   FALSE   FALSE
1   4   TRUE    FALSE
2   1   FALSE   TRUE
2   2   FALSE   FALSE
2   3   FALSE   FALSE
2   4   FALSE   FALSE

看起来像这样

kv <- c(drugA=1, drugB=2)

我有一个键值&#39;字典&#39;

id  time    drugA   drugB
1   1   1   NA
1   2   1   NA
1   3   NA  NA
1   4   1   NA
2   1   NA  2
2   2   NA  NA
2   3   NA  NA
2   4   NA  NA

我想使用这个字典将逻辑列中的值替换为&#39;字典中的值&#39;。输出应该如下所示。

library(microbenchmark)
d.orig <- data.table(
  id=c(rep(1:2,each=1e7)), 
  time=c(1e7,1e7),
  drugA=sample(c(T,F), 2e7, replace=T),
  drugB=sample(c(T,F), 2e7, replace=T)
  )

# Solution 1
foo1 <- function() {
  d.in <- data.table::copy(d.orig)
  d.in[, names(kv) := lapply(names(kv), function(x) {
    gx <- get(x)
    replace(NA_real_[seq_along(gx)], gx, kv[x])
    })]
}

# Solution 2
dt_kv <- data.table(drug = c("drugA","drugB"), value = c(1,2))
foo2 <- function() {
  d.in <- data.table::copy(d.orig)
  d.in <- melt(d.in, id.vars = c("id", "time"))[ 
    dt_kv, on = c(variable = "drug"), nomatch = 0][
    value == FALSE, i.value := NA]

  dcast(d.in, formula = id + time ~ variable, value.var = "i.value")
}

# Solution 3
kDT = data.table(variable = names(kv), value = TRUE, v = unname(kv))
foo3 <- function() {
  d.in <- data.table::copy(d.orig)
  DT = melt(d.in, id=c("id","time"))
  DT[kDT, on=.(variable, value), v := i.v ]
  dcast(DT, formula = id + time ~ variable, value.var = 'v')
}

最有效(最快)的方法是什么?

更新

我已尝试过以下解决方案,但无法找到重大差异(虽然我不确定我的比较方法是否有效)。

print(microbenchmark(foo1, times=1e4))
Unit: nanoseconds
 expr min lq    mean median uq   max neval
 foo1  33 50 85.8657     55 58 56717 10000

print(microbenchmark(foo2, times=1e4))
Unit: nanoseconds
 expr min lq    mean median uq   max neval
 foo2  29 48 70.8304     52 55 57644 10000

print(microbenchmark(foo3, times=1e4))
Unit: nanoseconds
 expr min lq    mean median uq   max neval
 foo3  30 36 61.1542     41 48 58015 10000

哪个产生(尽管仍有很多变化)

using (SqlBulkCopy bcp = new SqlBulkCopy(YourConnectionString))
            {
                // +1 to Marc Gravell for this neat little library to do the mapping for us
                // because DataTable isn't available until .NET Standard Library 2.0
                using (var dataReader = ObjectReader.Create(yourListOfObjects,
                    nameof(YourClass.Property1),
                    nameof(YourClass.Property2)))
                {
                    bcp.DestinationTableName = "YourTableNameInSQL";
                    bcp.ColumnMappings.Add(new SqlBulkCopyColumnMapping("Property1", "MyCorrespondingTableColumn"));
                    bcp.ColumnMappings.Add(new SqlBulkCopyColumnMapping("Property2", "TableProperty2"));


                    await bcp.WriteToServerAsync(dataReader).ConfigureAwait(false);
                }
            }

3 个答案:

答案 0 :(得分:4)

常见的方法是将数据移动到长格式:

DT = melt(d.in, id=c("id","time"))

然后将映射放在表中,类似于@ SymbolixAU的答案:

kDT = data.table(variable = names(kv), value = TRUE, v = unname(kv))

然后使用映射进行更新连接,通过引用添加新列:

DT[kDT, on=.(variable, value), v := i.v ]

一般来说,我认为如果您非常关心速度或简单的语法,那么您需要长格式数据而不是R中的宽数据,所以我会跳过最后的dcast步骤(参见@ SymbolixAU的答案)

答案 1 :(得分:3)

不确定它是否是最有效的方式,但你可以做到

d.in[, names(kv) := lapply(names(kv), function(x) {
        gx <- get(x)
        replace(NA_real_[seq_along(gx)], gx, kv[x])
    })]

在这里,我们使用kv迭代get中的名称以检索列值。然后我们用kv值替换新创建的NA矢量的相关值,得到

   id time drugA drugB
1:  1    1     1    NA
2:  1    2     1    NA
3:  1    3    NA    NA
4:  1    4     1    NA
5:  2    1    NA     2
6:  2    2    NA    NA
7:  2    3    NA    NA
8:  2    4    NA    NA

答案 2 :(得分:2)

您可以根据需要创建data.table查找词典,融合原始d.in,加入,更新和重新塑造

dt_kv <- data.table(drug = c("drugA","drugB"),
                    value = c(1,2))

d.in <- melt(d.in, id.vars = c("id", "time"))[ 
  dt_kv, on = c(variable = "drug"), nomatch = 0][
    value == FALSE, i.value := NA]

dcast(d.in, formula = id + time ~ variable, value.var = "i.value")

#    id time drugA drugB
# 1:  1    1     1    NA
# 2:  1    2     1    NA
# 3:  1    3    NA    NA
# 4:  1    4     1    NA
# 5:  2    1    NA     2
# 6:  2    2    NA    NA
# 7:  2    3    NA    NA
# 8:  2    4    NA    NA