我需要指定一个“秒”ID来对原始id
中的某些值进行分组。这是我的样本数据:
dt<-structure(list(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
period = c("start", "end", "start", "end", "start", "end"),
date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date"))),
class = c("data.table", "data.frame"),
.Names = c("id", "period", "date"),
sorted = "id")
> dt
id period date
1: aaaa start 2012-03-02
2: aaaa end 2012-03-05
3: aaas start 2012-08-21
4: aaas end 2013-02-25
5: bbbb start 2012-03-31
6: bbbb end 2013-02-11
列id
需要根据此列表进行分组(使用id2
中的相同值):
> groups
[[1]]
[1] "aaaa" "aaas"
[[2]]
[1] "bbbb"
我使用了以下代码,似乎可以使用以下代码warning
:
> dt[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
Warning message:
In `[.data.table`(dt, , `:=`(id2, which(vapply(groups, function(x, :
Invalid .internal.selfref detected and fixed by taking a copy of the whole table,
so that := can add this new column by reference. At an earlier point, this data.table has
been copied by R (or been created manually using structure() or similar). Avoid key<-,
names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use
set* syntax instead to avoid copying: setkey(), setnames() and setattr(). Also,
list (DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects),
use reflist() instead if needed (to be implemented). If this message doesn't help,
please report to datatable-help so the root cause can be fixed.
> dt
id period date id2
1: aaaa start 2012-03-02 1
2: aaaa end 2012-03-02 1
3: aaas start 2012-08-29 1
4: aaas end 2013-02-26 1
5: bbbb start 2012-03-31 2
6: bbbb end 2013-02-11 2
有人可以简单地解释这个警告的性质以及最终结果中的任何最终含义(如果有的话)吗?感谢
编辑:
以下代码实际上是在创建dt
时显示的,以及如何传递给提供警告的函数:
f.main <- function(){
f2 <- function(x){
groups <- list(c("aaaa", "aaas"), "bbbb") # actually generated depending on the similarity between values of x$id
x <- x[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
return(x)
}
x <- f1()
if(!is.null(x[["res"]])){
x <- f2(x[["res"]])
return(x)
} else {
# something else
}
}
f1 <- function(){
dt<-data.table(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
period = c("start", "end", "start", "end", "start", "end"),
date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date")))
return(list(res=dt, other_results=""))
}
> f.main()
id period date id2
1: aaaa start 2012-03-02 1
2: aaaa end 2012-03-02 1
3: aaas start 2012-08-29 1
4: aaas end 2013-02-26 1
5: bbbb start 2012-03-31 2
6: bbbb end 2013-02-11 2
Warning message:
In `[.data.table`(x, , `:=`(id2, which(vapply(groups, function(x, :
Invalid .internal.selfref detected and fixed by taking a copy of the whole table,
so that := can add this new column by reference. At an earlier point, this data.table
has been copied by R (or been created manually using structure() or similar).
Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole
data.table. Use set* syntax instead to avoid copying: setkey(), setnames() and setattr().
Also, list(DT1,DT2) will copy the entire DT1 and DT2 (R's list() copies named objects),
use reflist() instead if needed (to be implemented). If this message doesn't help,
please report to datatable-help so the root cause can be fixed.
答案 0 :(得分:11)
是的,问题在于清单。这是一个简单的例子:
DT <- data.table(1:5)
mylist1 <- list(DT,"a")
mylist1[[1]][,id:=.I]
#warning
mylist2 <- list(data.table(1:5),"a")
mylist2[[1]][,id:=.I]
#no warning
你应该避免将data.table复制到一个列表中(为了安全起见,我会避免在列表中放入DT)。试试这个:
f1 <- function(){
mylist <- list(res=data.table(id = c("aaaa", "aaaa", "aaas", "aaas", "bbbb", "bbbb"),
period = c("start", "end", "start", "end", "start", "end"),
date = structure(c(15401L, 15401L, 15581L, 15762L, 15430L, 15747L), class = c("IDate", "Date"))))
other_results <- ""
mylist$other_results <- other_results
mylist
}
答案 1 :(得分:10)
您可以在创建列表时“浅拷贝”,这样1)您不进行完整的内存复制(速度不受影响)和2)您没有得到内部参考错误(感谢@mnel这个技巧)。
set.seed(45)
ss <- function() {
tt <- sample(1:10, 1e6, replace=TRUE)
}
tt <- replicate(100, ss(), simplify=FALSE)
tt <- as.data.table(tt)
system.time( {
ll <- list(d1 = { # shallow copy here...
data.table:::settruelength(tt, 0)
invisible(alloc.col(tt))
}, "a")
})
user system elapsed
0 0 0
> system.time(tt[, bla := 2])
user system elapsed
0.012 0.000 0.013
> system.time(ll[[1]][, bla :=2 ])
user system elapsed
0.008 0.000 0.010
因此,您不要在速度上妥协,并且不会收到警告,然后是完整副本。 希望这会有所帮助。
答案 2 :(得分:6)
“通过复制检测并修复了无效的.internal.selfref ...”
在f2()中分配id2时无需复制,您可以通过更改直接添加列:
# From:
x <- x[, id2 := which(vapply(groups, function(x,y) any(x==y), .BY[[1]], FUN.VALUE=T)), by=id]
# To something along the lines of:
x$id2 <- findInterval( match( x$id, unlist(groups)), cumsum(c(0,sapply(groups, length)))+1)
然后,您可以继续使用'x'data.table,而不会发出警告。
此外,要简单地取消警告,您可以在f2(x[["res"]])
电话周围使用suppressWarnings()。
即使在小桌子上,也会有很大的性能差异:
Performance Comparison:
Unit: milliseconds
expr min lq median uq max neval
f.main() 2.896716 2.982045 3.034334 3.137628 7.542367 100
suppressWarnings(f.main()) 3.005142 3.081811 3.133137 3.210126 5.363575 100
f.main.direct() 1.279303 1.384521 1.413713 1.486853 5.684363 100