我在R中有一个270万行和22列的数据帧,它是388 MB。此数据框包含我想要子集的数据。事实上,我必须将其子集大约100,000次。做这个的最好方式是什么。现在我使用数据框,速度太慢。每次迭代大约需要1秒。谢谢。这是玩具代码:
s<-c(100,100,100,800,800,6662,33565,265653262,266532)
p<-c(5,5,5,10,10,10,8,9,10)
name<-c("bob","bob","bob","ed","ed","ed","joe","frank","ted")
time<- as.POSIXct(as.character(c("2014-10-27 18:11:36 PDT","2014-10-27 18:11:37 PDT","2014-10-27 18:11:38 PDT","2014-10-27 18:11:39 PDT","2014-10-27 18:11:40 PDT","2014-10-27 18:11:41 PDT","2014-10-27 19:11:36 PDT","2014-10-27 20:11:36 PDT","2014-10-27 21:11:36 PDT")))
dat<- data.table(s,p,name,time)
dat #here is the data frame in reality it has 2.7 million rows and 22 cols
这是子集。在这个玩具模型中,我只对子集进行一次打算 实际上我有一个100k迭代的循环,每次循环100,5,bob,#times将改变
result <- subset(dat, as.numeric(s) == 100
& p == 5
& name == "bob"
& time >= "2014-10-27 18:11:36 PDT"
& time <= "2014-10-27 18:12:00 PDT"
)
result
如何更快地完成此子集化?我尝试了data.table()而不是data.frame,但它仍然很慢
答案 0 :(得分:0)
这是一个简单的doParallel示例:
library(doParallel)
data(iris)
species.split <- split(iris, iris$Species)
cl <- makeCluster(detectCores())
registerDoParallel(cl)
species.models <- foreach(i=species.split) %dopar% {
m<-lm(i$Sepal.Length ~ i$Petal.Width*i$Petal.Length);
return(m)
}
对于你来说,它可能是你定义的不同子集的foreach,如:
split=c('bob','james','jones')
foreach(i=split) %dopar% {
result<-subset(dat,name==i)
assign(paste0('dat',i),result)
}