我想计算大型面板数据集的第一个差异。目前,这需要一个多小时。我真的很想知道是否还有其他选择可以加快流程。作为示例数据库:
set.seed(1)
DF <- data.table(panelID = sample(50,50), # Creates a panel ID
Country = c(rep("A",30),rep("B",50), rep("C",20)),
Group = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
norm = round(runif(100)/10,2),
Income = sample(100,100),
Happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = round(rnorm(10,0.75,0.3),2),
Educ = round(rnorm(10,0.75,0.3),2))
DF [, uniqueID := .I]
所以我尝试了以下内容:
DFx <- DF
start_time <- Sys.time()
DF <- DF[, lapply(.SD, function(x) x - shift(x)), by = panelID, .SDcols = (sapply(DF, is.numeric))]
end_time <- Sys.time()
DF <- DFx
start_time2 <- Sys.time()
cols = sapply(DF, is.numeric)
DF <- DF[, lapply(.SD, function(x) x - shift(x)), by = panelID, .SDcols = cols]
end_time2 <- Sys.time()
DF <- DFx
start_time3 <- Sys.time()
DF <- DF[order(panelID)] # Sort on year
nm1 <- sapply(DF, is.numeric) # Get the numerical columns
nm1 = names(nm1)
nm2 <- paste("delta", nm1, sep="_")[-6] # Paste
DF <- DF[,(nm2) := .SD - shift(.SD), by=panelID] # Creates
end_time3 <- Sys.time()
end_time3 - start_time3
end_time2 - start_time2
end_time - start_time
由于某种原因,第三个选项可在我的实际数据库中使用,但对于本示例而言则不行。它给出错误:Error in FUN(left, right) : non-numeric argument to binary operator
。对于我的实际数据库,这种计算方式也相当慢(然后我仍然必须进行子集化。)
有什么想法可以使速度更快吗?
答案 0 :(得分:3)
data.table针对许多行(而不是许多列)进行了优化。由于您有许多列,因此您可以尝试合并data.table:
DFm <- melt(DF[, cols, with = FALSE][, !"uniqueID"], id = "panelID")
#coerces all numers to double (common type),
#you could separate the data.table by integer/double to avoid this
DFm[, value := c(NA, diff(value)), by = .(panelID, variable)]
dcast(DFm, panelID + rowidv(DFm, cols = c("panelID", "variable")) ~ variable, value.var = "value")