Data.table:组合列(聚合的单个功能)

时间:2014-07-30 16:01:43

标签: r data.table

我有一些类似于

的数据

foo <- data.table(uid=c("a","b", "c"), var1=c(T, F, F), var2=c(F, T, F))

我希望使用使用以下聚合的函数将(var1, var2)融入var3

  • w IF var1 == T & var2 == T(否则:)
  • x IF var1 == T
  • y IF var2 == T
  • z IF var1 == F & var2 == F

即,给定foo,预期结果为

uid var3
  a    x
  b    y
  c    z

此外,foo除了(var1, var2)之外的任何其他列都应该被接纳到新的data.table

4 个答案:

答案 0 :(得分:6)

另一种方式是:

foo1 <- CJ(var1 = c(T,F), var2 = c(T,F))[, var3 := c('z', 'y', 'x', 'w')]
setkey(foo, var1, var2)

foo[foo1, var3 :=i.var3][order(uid)][,c(1,4), with=F]
#   uid var3
#1:   a    x
#2:   b    y
#3:   c    z

答案 1 :(得分:4)

这可能也没有优化,但我发现它比@David Arenburg的解决方案更容易阅读。

foo[, `:=` (var3 = ifelse(var1 & var2, "w", ifelse(var1, "x", ifelse(var2, "y", "z"))), 
            var1 = NULL, var2 = NULL)]
foo
#    uid var3
# 1:   a    x
# 2:   b    y
# 3:   c    z

答案 2 :(得分:4)

我个人不喜欢ifelse,特别是嵌套ifelse s .. :)。我想,在这种情况下,我们可以不用这样的东西吗?

foo[, `:=`(var3 = factor(2*var1+var2, levels=3:0, labels=c("w","x","y","z")), 
           var2 = NULL, var1 = NULL)]
#    uid var3
# 1:   a    x
# 2:   b    y
# 3:   c    z

答案 3 :(得分:1)

我做了一些基准测试,似乎@ akrun的解决方案是大型数据集中最快的解决方案。不得不进行一些更改(在所有更改的代码中都有注释),以使结果具有可比性,但这不应该对性能产生太大影响。

# setup of data
require(data.table)
require(microbenchmark)
set.seed(1)
Nsims <- 1e4 # size of dataset
foo <- data.table(uid = 1:Nsims,
                  var1 = sample(c(TRUE, FALSE), Nsims, TRUE), 
                  var2 = sample(c(TRUE, FALSE), Nsims, TRUE))
# benchmarktest
microbenchmark(
{ #@shadow
  foo1 <- copy(foo)
  foo1[, `:=` (var3=ifelse(var1&var2, "w", ifelse(var1, "x", ifelse(var2, "y", "z"))), 
               var1=NULL, var2=NULL)]
}
, 
{ #@Arun
  foo2 <- copy(foo)
  foo2[, `:=`(var3 = as.character(factor(2*var1+var2, levels=3:0, labels=c("w","x","y","z"))), 
              # used as.character to give same result as other solutions
              var2 = NULL, var1 = NULL)]
},
{ #@akrun
  foo3 <- copy(foo)
  foo.index <- CJ(var1 = c(T,F), var2 = c(T,F))[, var3 := c('z', 'y', 'x', 'w')]
  setkey(foo3, var1, var2)
  foo3 <- foo3[foo.index, var3 := i.var3][, `:=` (var1=NULL, var2=NULL)][order(uid)]
  # assigned to foo3 to get same result as other solutions and used var1:=NULL, etc to achieve OP's
  # requirement "Moreover, any additional column that foo has besides (var1, var2) should be taken over into the new data.table"
}
)
#        min        lq   median        uq      max neval
#  19.635460 19.801922 19.93224 20.814533 22.57868   100
#  12.611448 12.762514 12.79219 12.864043 48.10415   100
#   4.691303  4.945683  4.98808  5.084922  7.21636   100
#
# making sure they give the same solutions
all.equal(foo1, foo2)
# [1] TRUE
all.equal(foo1, foo3)
# [1] TRUE