我发现每个级别的因子循环都很慢。
数据是某些列车的时间表:
col1 col2 col3 col4 col5
train start density starttime arrivaltime
[factor] [factor] [factor] [date&time] [date&time]
有10米行。有大约1k列车,所以每列火车有~10k排。
我尝试了以下测试代码:
data = data[order(data$train, data$starttime), ] # sort according to train, and then according to starttime
length1 = numeric( length(levels(data$train)) )
ii = 1
sub = data[1,] # initialize it
for (t in levels(data$train))
{
sub = subset(data, train==t) #subset of each train
length1[ii] = nrow(sub)
ii = ii +1
print(ii)
}
它的工作速度非常慢 - 我的笔记本电脑上的每个循环都需要几秒钟。我想知道我能做些什么来提高效率。
例如,sub
是一个在每个循环中都会发生变化的变量。我应该避免将这些行复制到sub
吗? sub
在循环时改变长度,我应该在初始化时给它更大的内存空间吗?
PS 我真正想做的是,对于每列火车,如果命运之城= =下一趟的起始城市。代码是:
data = data[order(data$train, data$starttime), ] # sort according to train, and then according to starttime
sub = data[1,] # initialization
for (t in levels(data$train))
{
sub = subset(data, train==t) #subset of each train
for (i in 1:(nrow(sub)-1) )
{
if ( as.character(sub$destiny[i]) != as.character(sub$start[i+1]) )
# if the destiny != the start city of the next trip
{ do something }
}
}
答案 0 :(得分:-1)
如何使用dplyr
包。它是由Hadley为此目的编写的,并且使用起来非常直观。以下链接包含该软件包的教程。
http://rstudio-pubs-static.s3.amazonaws.com/11068_8bc42d6df61341b2bed45e9a9a3bf9f4.html
http://www.r-bloggers.com/hands-on-dplyr-tutorial-for-faster-data-manipulation-in-r/