使用对现有向量

时间:2015-07-29 23:12:17

标签: r memory data.table

我想从大型预先存在的向量创建data.table而不复制这些向量。也就是说,我想创建一个data.table,它只对指向底层向量的指针进行浅表复制,而不是向量中数据的完整副本。

我认为这是一个共同的愿望,但我还没有找到任何办法。一旦向量是另一个data.table中的列,我可以便宜地制作更多副本,但我还没有看到如何通过引用创建该初始表的说明。

这可能吗?这是我尝试单个向量的方法,尽管我的实际目标是使用几个大向量创建data.table:

nate@ubuntu:~/R/byreference$ cat dt.R

library(data.table)

# Some large vector that needs to be created anyway
largeVector = rnorm(1000*1000)

# I'd like to see no large memory allocations
Rprofmem("allocations.txt")

# I want to create dt without copying largeVector
dt = as.data.table(list(x = largeVector))

# This variation doesn't work either:
# dt = data.table(list(x = largeVector))

# This one comes closest to working, but acts as copy-on-write
# dt = setDT(list(x = largeVector))

# Currently, I see lots and lots of memory allocations
# (some may be https://github.com/Rdatatable/data.table/issues/1062)
Rprofmem(NULL)

# The addresses of the vectors should be identical if no copy occurred
identical(address(largeVector), address(dt$x))  # FIXME: should be TRUE

# For comparison, the addresses are identical if I copy 'dt'
dtCopy = dt
identical(address(dtCopy$x), address(dt$x))

# I'm not looking for copy-on-write semantics.  I'd like a simple
# reference, the same as would occur with a shallow copy of a data.table
dt[, x := 2.0*x]

# But this works! (see second edit at bottom)
# dt[1:.N, x := 2.0*x]    

# All of these should be true (currently only the last two are)
identical(dt$x, largeVector)                   # FIXME: should be TRUE
identical(address(largeVector), address(dt$x)) # FIXME: should be TRUE
identical(dt$x, dtCopy$x)
identical(address(dtCopy$x), address(dt$x))

这就是我在R 3.1.2和data.table 1.9.4中看到的内容:

nate@ubuntu:~/R/byreference$ Rscript dt.R
[1] FALSE
[1] TRUE
[1] FALSE
[1] FALSE
[1] TRUE
[1] TRUE

nate@ubuntu:~/R/byreference$ cat allocations.txt
1480 :"as.data.table"
6320 :"as.data.table"
6320 :"as.data.table"
1064 :"as.data.table"
344 :"as.data.table"
928 :"as.data.table"
1808 :"as.data.table"
600 :"as.data.table"
192 :"as.data.table"
408 :"as.data.table.list" "as.data.table"
1256 :"as.data.table.list" "as.data.table"
1248 :"as.data.table.list" "as.data.table"
1064 :"as.data.table.list" "as.data.table"
240 :"as.data.table.list" "as.data.table"
432 :"as.data.table.list" "as.data.table"
184 :"as.data.table.list" "as.data.table"
8000040 :"copy" "as.data.table.list" "as.data.table"
216 :"copy" "as.data.table.list" "as.data.table"
440 :"copy" "as.data.table.list" "as.data.table"
440 :"copy" "as.data.table.list" "as.data.table"
1064 :"copy" "as.data.table.list" "as.data.table"
536 :"as.data.table.list" "as.data.table"
1816 :"as.data.table.list" "as.data.table"
1808 :"as.data.table.list" "as.data.table"
1064 :"as.data.table.list" "as.data.table"
384 :"as.data.table.list" "as.data.table"
720 :"as.data.table.list" "as.data.table"
256 :"as.data.table.list" "as.data.table"
1024 :"as.data.table.list" "as.data.table"
4016 :"as.data.table.list" "as.data.table"
4016 :"as.data.table.list" "as.data.table"
1064 :"as.data.table.list" "as.data.table"
208 :"as.data.table.list" "as.data.table"
656 :"as.data.table.list" "as.data.table"
1264 :"as.data.table.list" "as.data.table"
416 :"as.data.table.list" "as.data.table"
184 :"eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
336 :"eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
336 :"eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
1064 :"eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
304 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
872 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
872 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
1064 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
208 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
368 :"get" "dim" "ncol" "eval" "eval" "alloc.col" "as.data.table.list" "as.data.table"
840 :"alloc.col" "as.data.table.list" "as.data.table"
840 :"alloc.col" "as.data.table.list" "as.data.table"
哇,哇,建造' dt'有比我预期更多的分配!虽然大多数都很小,但我真的希望能够避开大型的,因为我的矢量可能都是几GB。

编辑:Eddi最初将此标记为Sub-assign by reference on vector in R的副本。不是。我的目标不是修改矢量;我的目标是从矢量创建data.table而不复制该矢量。我只使用了修改,因为大多数读者不会以允许Rprofmem用户的方式编译R,并且检查副作用是保证不发生复制。我已经改变了这个例子,试图让这个更清楚。

编辑:那就是说,我认为Eddi是对的,我的问题实际上是由于他刚刚提交的错误(更新:现已修复):https://github.com/Rdatatable/data.table/issues/1248。 " dt = setDT(list(x = largeVector))"的组合然后" dt [1:.N,x:= 2.0 * x]"按照我的预期工作:修改到位,没有大的分配。因此,虽然我不认为这实际上是重复的,但让这个问题消失可能很好。

1 个答案:

答案 0 :(得分:2)

开放data.table issue #1248 [更新:现已解决]尽管如此,将一组向量转换为data.table而不复制数据的方法是:

a = 1:5
b = 5:1
address(a)
#[1] "000000000FFE6AE0"
address(b)
#[1] "000000000FFE6A50"

dt = setDT(list(a, b))
sapply(dt, address)
#                V1                 V2 
#"000000000FFE6AE0" "000000000FFE6A50"