我正在处理一个包含200,000多行和多列的数据帧。让我们采用样本虚拟版本df:
set.seed(1)
"timeslot" = c(as.integer(abs(runif(10000,min=1,max=1000))))
"ID" = c(LETTERS[abs(as.integer(rnorm(10000,2)**3))%%9+1])
"variable1" = c(as.integer(rnorm(10000,2)**3))
df = data.frame(timeslot,ID,variable1)
df = df[order(df$timeslot, df$ID),]
我还计算一列来检查该行的ID是否也出现在上一个时间段的某个地方,称为min1:
df$min1 <- sapply(seq(nrow(df)), function(x)
{
if(df[x, "timeslot"] == 1){0} else {
max(df[x, "ID"] %in% df[df$timeslot == df[x,"timeslot"] - 1,"ID"])}
})
这一切都很顺利,并提供以下头部(df)/尾部(df):
timeslot ID variable1 min1
4919 1 A 15 0
2329 1 C 48 0
7359 1 C 1 0
1978 1 E 6 0
2883 1 F 7 0
7448 1 F 21 0
-------------------------------
8462 998 F 1 1
1724 998 H 2 0
989 999 A 7 1
2589 999 D 12 1
3473 999 D 0 1
780 999 I 5 0
我想对variable1执行一些计算,按唯一时隙+ ID分组。其中一项计算是funfac:
total=0
funfac <- function(x,y){ for (i in x){ (i <- i ** y);
total <- total + i};return((abs(total/(length(x))))**(1/y));total=0 }
然而,现在出现了困难的部分:在特定时间段中的每个ID我想对该时隙和前一个时隙中的所有相同ID进行计算。因此,如果在时隙'2'中存在3x D,并且在时隙'1'中存在2x D,则应该在所有5个D上进行计算。我的列min1有助于识别前一个时间段中是否存在该ID。如果不是:计算应返回NA。
首先,我使用以下代码执行此操作:
lp5 = c()
for (j in 1:nrow(df)){
if (df[j,"min1"] == 0){lp5 = c(lp5,NA)} else {
total = 0
x = df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"]
for (i in x){
i = (i ** 5);
total <- total + i
}
lp5 = c(lp5,((abs(total/(length(x))))**(1/(5))))
}
}
tempdf = data.frame(df[,"timeslot"],df[,"ID"], lp5)
lp5 = tempdf[!duplicated(tempdf[,1:2]),][,3]
认为我进行了大量的计算加倍,我想:为什么不检查计算是否已经完成。通过在数据框中添加唯一的时间范围+ ID(包括计算值)来实现此目的。每次检查值是否已经在数据帧中。
lp5DF = data.frame("timeslot" = numeric(0), "ID" = character(0), "lp5" = numeric(0))
for (j in 1:nrow(df)){
if (duplicated(rbind(lp5DF[,1:2],data.frame(timeslot=df[j,"timeslot"], ID=df[j,"ID"])))[nrow(lp5DF)+1]) {next} else{
if (df[j,"min1"] == 0){lp5DF = rbind(lp5DF, data.frame("timeslot" = df[j,"timeslot"], "ID" = df[j,"ID"], "lp5" = NA))} else {
total = 0
x = df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"]
for (i in x){
(i <- i ** 5);total <- total + i
}
lp5DF = rbind(lp5DF, data.frame("timeslot" = df[j,"timeslot"], "ID" = df[j,"ID"], "lp5" = ((abs(total/(length(x))))**(1/5)))) }
}
}
lp5DF的输出(头/尾)将是:
timeslot ID lp5
1 1 A NA
2 1 B NA
3 1 C NA
4 1 D NA
5 1 E NA
6 1 F NA
-------------------------
7738 999 B 14.83423
7739 999 C 14.80149
7740 999 E NA
7741 999 F 49.48538
7742 999 G 23.05222
7743 999 H NA
和:lp5DF[,3]==lp5
但是,检查这个似乎要慢很多(在我的情况下是6.5倍)。由于我必须在很多行上多次运行这种计算(数据帧可能会在项目的后期扩展),因此我的方法都太慢了。为什么第二个这么慢,有没有办法加速这个?也许包含lapply
或dplyr
包的内容?
答案 0 :(得分:2)
要优化很多。尝试学习数据操作包,例如dplyr
,data.table
。
min1
library(dplyr)
dfs <- split(df$ID, df$timeslot)
df$min1 <- unlist(mapply(`%in%`, dfs, lag(dfs)))
lp5
有点棘手,但可管理
df1 <- df %>%
group_by(timeslot, ID) %>%
summarise(min1 = all(min1), s = sum(variable1^5), n = n()) %>%
group_by(ID) %>%
mutate(s1 = s + lag(s), n1 = n + lag(n), lp5 = ifelse(min1, abs((s1/n1)^(1/5)), NA))
lp5 <- df1$lp5
data.table
等效于
library(data.table)
setDT(df)
dt1 <- df[, .(min1 = all(min1), s = sum(variable1^5), n = .N), by=.(timeslot, ID)]
dt1[, `:=`(s1 = s + shift(s), n1 = n + shift(n)), by=ID]
dt1[min1==TRUE, lp5 := abs((s1/n1)^(1/5)), by=ID]
lp5 <- dt1$lp5