我正在使用JSON数据进行解析(使用jsonlite::fromJSON
)到嵌套的data.frame
,然后使用data.table
递归设置为setDT
。问题在于"沿着"爆炸?嵌套data.table
元素的任何列(例如,dt[, nested_dt[[1]], by=.(a, b, c)]
,请参阅接受的答案here)有必要(1)确保所有嵌套的data.table
具有相同的列和(2)确保这些列具有相同的类。
麻烦的是R(或者data.table
似乎存在一些问题,我不确定)在将新列添加到嵌套data.table
时触发浅拷贝。
我想做这样的事情(在添加的列名和类型周围有实际逻辑):
add_col1 <- function(dt) {
if (is.data.table(dt))
dt[, new_col:=NA]
if (is.list(dt))
lapply(dt, add_col1)
return(invisible())
}
然而测试产量
dt <- data.table(a=c(1,2), b=list(data.table(d=c("a", "b"), e=c(100, 200))))
dt
# a b
# 1: 1 <data.table>
# 2: 2 <data.table>
add_col1(dt)
# Warning messages:
# 1: In `[.data.table`(dt, , `:=`(new_col, NA)) :
# Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table
# so that := can add this new column by reference. At an earlier point, this data.table
# has been copied by R (or been created manually using structure() or similar). Avoid
# key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table.
# Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2,
# list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please
# upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to
# datatable-help so the root cause can be fixed.
dt
# a b new_col
# 1: 1 <data.table> NA
# 2: 2 <data.table> NA
dt[, b]
# [[1]]
# d e
# 1: a 100
# 2: b 200
#
# [[2]]
# d e
# 1: a 100
# 2: b 200
所以我触发了一个错误的副本并且没有得到所需的结果(new_col
已添加到顶级data.table
这是好的,但不是嵌套的data.table
这是坏事)。由于我认为问题是lapply
没有分配回原来的父data.table
,我试过了:
add_col2 <- function(dt) {
if (is.data.table(dt)) {
dt[, new_col:=NA]
id <- unlist(lapply(dt, is.list))
for (col in colnames(dt)[id])
dt[, c(col):=add_col2(get(col))]
} else if (is.list(dt))
return(invisible(lapply(dt, add_col2)))
return(invisible(dt))
}
如下所示,这会生成所需的输出,但我不会避免浅拷贝(或它附带的警告消息)。
dt <- data.table(a=c(1,2), b=list(data.table(d=c("a", "b"), e=c(100, 200))))
dt
# a b
# 1: 1 <data.table>
# 2: 2 <data.table>
add_col2(dt)
# Warning messages:
# 1: In `[.data.table`(dt, , `:=`(new_col, NA)) :
# Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table
# so that := can add this new column by reference. At an earlier point, this data.table
# has been copied by R (or been created manually using structure() or similar). Avoid
# key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table.
# Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2,
# list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please
# upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to
# datatable-help so the root cause can be fixed.
dt
# a b new_col
# 1: 1 <data.table> NA
# 2: 2 <data.table> NA
dt[, b]
# [[1]]
# d e new_col
# 1: a 100 NA
# 2: b 200 NA
#
# [[2]]
# d e new_col
# 1: a 100 NA
# 2: b 200 NA
有正确方式吗?我可以抑制警告并使用上面的add_col2
模式,但是如果有一种方法可以修改嵌套数据而不需要复制那么好的。我也知道将rbindlist
与fill=TRUE
一起使用的可能性,但由于我的用例涉及by=
论证,我宁愿避免这种做法。
这些问题有助于理解,但没有解决我的问题:
Adding new columns to a data.table by-reference within a function not always working
Using setDT inside a function
编辑------------------------
避免lapply
似乎没有帮助。以下结果与add_col2
的结果完全相同。
add_col3 <- function(dt) {
if (is.data.table(dt)) {
dt[, new_col:=NA]
id <- unlist(lapply(dt, is.list))
for (col in colnames(dt)[id]) {
for (i in seq(1, dt[, .N]))
dt[i, c(col):=.(list(add_col3(get(col)[[1]])))]
}
} else if (is.list(dt))
stop("should not reach this now")
return(invisible(dt))
}
编辑2 -------------------------
根据以下Eddi的评论,我通过添加add_col1
/ setDF
步骤setDT
得到了所需的结果:
dt <- data.table(a=c(1,2), b=list(data.table(d=c("a", "b"), e=c(100, 200))))
# here is the addition
lapply(dt$b, setDF)
lapply(dt$b, setDT)
dt
# a b
# 1: 1 <data.table>
# 2: 2 <data.table>
add_col1(dt)
dt
# a b new_col
# 1: 1 <data.table> NA
# 2: 2 <data.table> NA
dt[, b]
# [[1]]
# d e new_col
# 1: a 100 NA
# 2: b 200 NA
#
# [[2]]
# d e new_col
# 1: a 100 NA
# 2: b 200 NA
我不明白为什么这一步有效。它似乎不是因为原始的dt
是通过回收嵌套的data.table
而形成的。我使用
dt <- data.table(a=c("abc", "def", "ghi"))
ndt1 <- data.table(d=c(1.2, 1.4), e=c("a1", "b1"))
ndt2 <- data.table(d=c(1L, 2L), e=c("a2", "b2"), f=c(1, 2))
ndt3 <- data.table(d=c(1.6, 3.4), e=c("a3", "b3"))
dt[, b:=c(list(ndt1),
list(ndt2),
list(ndt3))]