Question

我有两个真实的data.table代码示例，它们可以正常工作但似乎消耗的内存比我预期的要多，而且我真的很感激如何制作这个代码的想法代码更节省内存。

A = data.table(a=c(rep(1,5),rep(2,5)), b1=2:11, b2=22:31, c=c(1,2,1,2,1,2,1,2,1,2))

# Example 1:
# Pick the column name (b1 or b2) based on the value in column a 
# and assign the value from <b1 or b2> by reference to column res
setkey(A, a, c)
A[, res:=get(paste0("b", a)), by=c("a", "c")] 

# Example 2:
# Group the values of A by key, saving the following: 
# 1) number of values in column res that meet some condition
# 2) the minimum value of column a
setkey(A, c)
z = A[, list(length(.I[res>5]), min(res)), by=c]

我使用lineprof使用更大的实际数据对它们进行了测试，并且它们在使用data.tables的其他高效代码中都是异常值。

# This is more like the real size of the data I'm dealing with
A = data.table(a=c(rep(1,5e6),rep(2,5e6)), 
               b1=1:5e6, b2=(5e6+1):10e6, 
               c=round(runif(1e7, min=1, max=2)))

任何建议都将不胜感激！

Answer 1

如果示例1只是使用by = list(a,c)按行处理，那么得到工作，那么

setkey(A,a)
A[.(1), res := b1]
A[.(2), res:= b2]

应该更有效率

例如2，按res排序/键入也可以提高性能

setkey(A, c,res)
z = A[, list(length(.I[res>5]),(res[1])), by=c]

如何使这个R data.table代码更节省内存？

1 个答案: