通过引用

时间:2016-04-05 18:52:04

标签: r data.table

我正在使用JSON数据进行解析(使用jsonlite::fromJSON)到嵌套的data.frame,然后使用data.table递归设置为setDT 。问题在于"沿着"爆炸?嵌套data.table元素的任何列(例如,dt[, nested_dt[[1]], by=.(a, b, c)],请参阅接受的答案here)有必要(1)确保所有嵌套的data.table具有相同的列和(2)确保这些列具有相同的类。

麻烦的是R(或者data.table似乎存在一些问题,我不确定)在将新列添加到嵌套data.table时触发浅拷贝。

我想做这样的事情(在添加的列名和类型周围有实际逻辑):

add_col1 <- function(dt) {
  if (is.data.table(dt)) 
    dt[, new_col:=NA]

  if (is.list(dt)) 
    lapply(dt, add_col1)

  return(invisible())
}

然而测试产量

dt <- data.table(a=c(1,2), b=list(data.table(d=c("a", "b"), e=c(100, 200))))
dt
#    a            b 
# 1: 1 <data.table>     
# 2: 2 <data.table> 

add_col1(dt)
# Warning messages:
# 1: In `[.data.table`(dt, , `:=`(new_col, NA)) :
#    Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table 
#    so that := can add this new column by reference. At an earlier point, this data.table 
#    has been copied by R (or been created manually using structure() or similar). Avoid 
#    key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. 
#    Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, 
#    list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please 
#    upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to 
#    datatable-help so the root cause can be fixed.

dt
#    a            b new_col
# 1: 1 <data.table>      NA
# 2: 2 <data.table>      NA

dt[, b]
# [[1]]
#    d   e
# 1: a 100
# 2: b 200
# 
# [[2]]
#    d   e
# 1: a 100
# 2: b 200

所以我触发了一个错误的副本并且没有得到所需的结果(new_col已添加到顶级data.table这是好的,但不是嵌套的data.table这是坏事)。由于我认为问题是lapply没有分配回原来的父data.table,我试过了:

add_col2 <- function(dt) {
  if (is.data.table(dt)) {
    dt[, new_col:=NA]

    id <- unlist(lapply(dt, is.list))
    for (col in colnames(dt)[id])
      dt[, c(col):=add_col2(get(col))]
  } else if (is.list(dt)) 
    return(invisible(lapply(dt, add_col2)))

  return(invisible(dt))
}

如下所示,这会生成所需的输出,但我不会避免浅拷贝(或它附带的警告消息)。

dt <- data.table(a=c(1,2), b=list(data.table(d=c("a", "b"), e=c(100, 200))))
dt
#    a            b 
# 1: 1 <data.table>     
# 2: 2 <data.table> 

add_col2(dt)
# Warning messages:
# 1: In `[.data.table`(dt, , `:=`(new_col, NA)) :
#    Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table 
#    so that := can add this new column by reference. At an earlier point, this data.table 
#    has been copied by R (or been created manually using structure() or similar). Avoid 
#    key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. 
#    Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, 
#    list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please 
#    upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to 
#    datatable-help so the root cause can be fixed.

dt
#    a            b new_col
# 1: 1 <data.table>      NA
# 2: 2 <data.table>      NA

dt[, b]
# [[1]]
#    d   e new_col
# 1: a 100      NA
# 2: b 200      NA
# 
# [[2]]
#    d   e new_col
# 1: a 100      NA
# 2: b 200      NA

正确方式吗?我可以抑制警告并使用上面的add_col2模式,但是如果有一种方法可以修改嵌套数据而不需要复制那么好的。我也知道将rbindlistfill=TRUE一起使用的可能性,但由于我的用例涉及by=论证,我宁愿避免这种做法。

这些问题有助于理解,但没有解决我的问题:
Adding new columns to a data.table by-reference within a function not always working
Using setDT inside a function

编辑------------------------

避免lapply似乎没有帮助。以下结果与add_col2的结果完全相同。

add_col3 <- function(dt) {
  if (is.data.table(dt)) {
    dt[, new_col:=NA]
    id <- unlist(lapply(dt, is.list))
    for (col in colnames(dt)[id]) {
      for (i in seq(1, dt[, .N]))
        dt[i, c(col):=.(list(add_col3(get(col)[[1]])))]
    }
  } else if (is.list(dt)) 
    stop("should not reach this now")

  return(invisible(dt))
}

编辑2 -------------------------

根据以下Eddi的评论,我通过添加add_col1 / setDF步骤setDT得到了所需的结果:

dt <- data.table(a=c(1,2), b=list(data.table(d=c("a", "b"), e=c(100, 200))))

# here is the addition
lapply(dt$b, setDF)
lapply(dt$b, setDT)

dt
#    a            b
# 1: 1 <data.table>
# 2: 2 <data.table>

add_col1(dt)
dt
#    a            b new_col
# 1: 1 <data.table>      NA
# 2: 2 <data.table>      NA

dt[, b]
# [[1]]
#    d   e new_col
# 1: a 100      NA
# 2: b 200      NA
# 
# [[2]]
#    d   e new_col
# 1: a 100      NA
# 2: b 200      NA

我不明白为什么这一步有效。它似乎不是因为原始的dt是通过回收嵌套的data.table而形成的。我使用

获得了相同的结果
dt <- data.table(a=c("abc", "def", "ghi"))
ndt1 <- data.table(d=c(1.2, 1.4), e=c("a1", "b1"))
ndt2 <- data.table(d=c(1L, 2L), e=c("a2", "b2"), f=c(1, 2))
ndt3 <- data.table(d=c(1.6, 3.4), e=c("a3", "b3"))
dt[, b:=c(list(ndt1),
          list(ndt2),
          list(ndt3))]

0 个答案:

没有答案