使用`:=`在data.table中分配具有不一致的行为

时间:2019-01-15 16:36:47

标签: r data.table

考虑一个data.table dt:

library(data.table)
dt  = setDT(structure(list(grp = c("a", "a", "b", "b", "b", "c", "c"),
                     yr = c(2000, 2012, 2004, 2008, 2014, 2008, 2016),
                     sal = c(20000, 240000, 30000,100000,120000, 15000, 60000)), 
.Names = c("grp", "yr", "sal"), 
row.names = c(NA,-7L), class = c("data.table", "data.frame")))

我有一个伪函数tag,该函数根据salyr上的某些条件返回一个字符值。

tag = function(x){if(x$yr<2010 & x$sal<25000) {return(list(comment="okay"))} 
             else if(x$yr<2010 & x$sal>=25000) {return(list(comment="cool"))} 
             else if(x$yr>=2010 & x$sal<100000){return(list(comment="okay"))} 
             else if(x$yr>=2010 & x$sal>=100000){return(list(comment="cool"))} }

该函数返回的所有值都包含在list()调用中,以便可以将返回的值分配给表mycomment中的新列dt。但是,以下两个调用的行为有所不同。

dt[,mycomment:=tag(.SD),by=1:nrow(dt)]
#mycomment appears as a character vector

dt[,`:=`(mycomment=tag(.SD)),by=1:nrow(dt)]
#mycomment appears as a list

在这种情况下,:=运算符的行为不同的原因是什么?

1 个答案:

答案 0 :(得分:3)

The function call for j in x[i, j, ...] when making an assignment to x is

`:=`(col1_name = col1, col2_name = col2)

# or

c("col1_name", "col2_name") := list(col1, col2)

The second way exists for user convenience (so you don't have to mess with backticks around :=). A further convenience is offered when there is a single column:

`:=`(col1_name = col1)

# or 

col1_name := list(col1)

# or 

col1_name := col1

Here, the final option saves you from having to wrap in list(...). The same convenience feature shows up when by= is present. In both cases, the expectation is that j evaluates to a list of columns, which is why a bare vector is also treated as a length-one list of columns. If you want to avoid reckoning with this inconsistency, you could always write list(...) or always use the `:=`(...) in j.

In your example, this might mean changing your function to return a single column instead of wrapping in list(...). For some other ideas and references to the vignettes included with the package, maybe see Adding list columns to data tables in R returns inconsistent output - feature or bug?

Alternately, you could apply the tag rule more efficiently with something like a "non-equi join":

mDT = data.table(
  yr_up  = c(2010, 2010, Inf, Inf), 
  sal_up = c(25000, Inf, 100000, Inf), 
  value  = c("okay", "cool", "okay", "cool")
)

dt[, cmt := mDT[.SD, on=.(yr_up > yr, sal_up > sal), mult="first"]$value]