我在R中做了一个跟随函数,它会找出某个时间间隔内的obs数。
time_interval <- function(time_vector){
time_seq <- seq(from=as.POSIXct("2012-01-01 00:00:00", tz="UTC"),
to=as.POSIXct("2012-01-01 23:00:00", tz="UTC"), by="hour")
time_seq <- strftime(time_seq, format="%H:%M:%S", tz = "UTC")
start_time <- times(time_seq)
end_time <- times(start_time) + times("01:59:59")
time_df <- data.frame(start_time = start_time, end_time = end_time)
format_time_vector <-times(time_vector) #converting into times format
time_count <- c()
time_interval <- c()
for(i in 1:NROW(time_df)){
time_count <- append(time_count,sum(format_time_vector >= times(time_df[i,1]) & format_time_vector <= times(time_df[i,2])))
time_interval <- append(time_interval,paste(as.character(time_df[i,1]), as.character(time_df[i,2])))
}
my_new_data <- data.frame(timeinterval = time_interval, timecount = time_count)
return(my_new_data)
}
我有以下数据框
structure(list(email_address_hash = structure(1:3, .Label = c("0004eca7b8bed22aaf4b320ad602505fe9fa9d26",
"00198ee5364d73796e0e352f1d2576f8e8fa99db", "35c0ef2c2a804b44564fd4278a01ed25afd887f8"
), class = "factor"), open_times = structure(c(1L, 3L, 2L), .Label = c("04:39:24 10:39:43",
"09:57:20 19:00:09", "21:12:04 07:05:23 06:31:24"), class = "factor"),
desired_training_list = list(list("04:39:24"), list(c("21:12:04",
"07:05:23")), list("09:57:20")), desired_testing_list = c("10:39:43",
"06:31:24", "19:00:09")), .Names = c("email_address_hash",
"open_times", "desired_training_list", "desired_testing_list"
), row.names = c(NA, -3L), class = "data.frame")
目标: - 找出我们有最大观察时间的时间间隔。我们花了2个小时的时间间隔。例如,如果我们在时间间隔00:00:00-2:00:00之间有10个obs,并且在时间间隔8:00:00-10:00:00之间有5个obs。我们将选择00:00:00-2:00:00作为输出并将其显示在数据框列中
请注意,如果我们在所有时间间隔内都有相同的时间段,那么我们可以随机选择任何时间间隔,以获得该客户的最佳时间间隔。
我尝试了以下方法
data$training_best_time <- rep('NA',NROW(data))
data$training_best_time_count <- rep(0,NROW(data))
data$training_best_time<- apply(data[,3,drop= FALSE], MARGIN = 1, function(x) as.character(setorder(time_interval(as.vector(unlist(x))), -timecount)[1,1]))
data$training_best_time_count <- apply(data[,3,drop= FALSE], MARGIN = 1, function(x) as.character(setorder(time_interval(as.vector(unlist(x))), -timecount)[1,2]))
问题: - 当我在数百万行上运行此功能时,这需要4-5个小时才能完成,甚至更多。我想让它更快
问题区域估计: - 根据我对data.table的经验,我认为这是我自制的功能,让R花费这么多时间。虽然我不确定。我也试过在data.table中循环,但是这对缩短执行时间没什么帮助。
请帮我加快代码速度。如果您在理解我的问题时遇到任何困难,请告诉我
答案 0 :(得分:1)
只是为了保持你的功能(而不是提到它的名称),你可以通过略微调整for
循环来找到一些改进。您可以预先分配矢量来开始,这应该有所帮助,并且还完全删除循环的time_interval
部分(因为您可以非常容易地向量化该部分):
time_count <- vector( mode = "integer", length = nrow( time_df ) )
for(i in 1:nrow(time_df)){
time_count[i] <- sum(format_time_vector >= times(time_df[i,1]) & format_time_vector <= times(time_df[i,2]))
}
time_interval <- paste( time_df$start_time, time_df$end_time )
另外,我刚刚意识到你正在运行整个脚本两次,输出的每一列一次,你可以运行一次,将两个输出作为列表项,然后将它们绑定为数据帧。
data2 <- lapply(data[[3]], function(x) as.character(setorder(time_interval(as.vector(unlist(x))), -timecount)[1,]))
data2 <- do.call( rbind, data2 )
names( data2 ) <- c( "training_best_time", "training_best_time_count" )
data <- cbind( data, data2 )