我正在使用各种c(“s_size”,“reps”)的c(“x”,“y”,“density”)列对数据帧中的行进行二次取样。 Reps = replicates,s_size =从整个数据帧中子采样的行数。
> head(data_xyz)
x y density
1 6 1 0
2 7 1 17600
3 8 1 11200
4 12 1 14400
5 13 1 0
6 14 1 8000
#Subsampling###################
subsample_loop <- function(s_size, reps, int) {
tm1 <- system.time( #start timer
{
subsample_bound = data.frame()
#Perform Subsampling of the general
for (s_size in seq(1,s_size,int)){
for (reps in 1:reps) {
subsample <- sample.df.rows(s_size, data_xyz)
assign(paste("sample" ,"_","n", s_size, "_", "r", reps , sep=""), subsample)
subsample_replicate <- subsample[,] #temporary variable
subsample_replicate <- cbind(subsample, rep(s_size,(length(subsample_replicate[,1]))),
rep(reps,(length(subsample_replicate[,1]))))
subsample_bound <- rbind(subsample_bound, subsample_replicate)
}
}
}) #end timer
colnames(subsample_bound) <- c("x","y","density","s_size","reps")
subsample_bound
} #end function
Here's the function call:
source("R/functions.R")
subsample_data <- subsample_loop(s_size=206, reps=5, int=10)
这是行子样本函数:
# Samples a number of rows in a dataframe, outputs a dataframe of the same # of columns
# df Data Frame
# N number of samples to be taken
sample.df.rows <- function (N, df, ...)
{
df[sample(nrow(df), N, replace=FALSE,...), ]
}
这太慢了,我已经尝试了几次应用函数并没有运气。我将从1:250为每个s_size做大约1,000-10,000次重复。
让我知道你的想法!提前谢谢。
=============================================== ========================== 更新编辑:从中抽样的样本数据: https://www.dropbox.com/s/47mpo36xh7lck0t/density.csv
Joran在函数中的代码(在sourced function.R文件中):
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
resampling_custom <- function(dat, s_size, int, reps) {
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
}
调用函数
set.seed(2)
out <- resampling_custom(dat=retinal_xyz, s_size=206, int=5, reps=10)
不幸的是,输出数据警告消息:
Warning message:
In mapply(foo, i = ss, j = id, MoreArgs = list(data = dat), SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter
答案 0 :(得分:3)
我很少考虑实际优化这一点,我只是专注于做一些至少合理的事情,同时匹配你的程序。
您最大的问题是您通过rbind
和cbind
种植对象。基本上,只要您看到有人写data.frame()
或c()
并使用rbind
,cbind
或c
展开该对象,您就可以确定生成的代码将会实质上是尝试任务的最慢的方式。
这个版本的速度提高了大约12-13倍,如果你真的想到它,我相信你可以从中榨取更多的东西:
s_size <- 200
int <- 10
reps <- 30
ss <- rep(seq(1,s_size,by = int),each = reps)
id <- rep(seq_len(reps),times = s_size/int)
foo <- function(i,j,data){
res <- data[sample(nrow(data),i,replace = FALSE),]
res$s_size <- i
res$reps <- rep(j,i)
res
}
out <- do.call(rbind,mapply(foo,i = ss,j = id,MoreArgs = list(data = dat),SIMPLIFY = FALSE))
关于R的最好的部分是,不仅这种方式更快,而且代码也更少。