我正在研究一个问题,我们正在尝试为大量对(+40 000)创建时间序列差异,其中时间序列(每天+150点)
每一行代表我们想要比较的两个人
finalize
每一行代表特定个人的时间序列数据
pairs = data.frame("number" = c(1,2,3,4),
"name1" = c("A","B","C","D"),
"name2" = c("B","D","D","A")
)
pairs$name1 <- as.character(pairs$name1)
pairs$name2 <- as.character(pairs$name2)
我有以下R代码,其目标是为每对个体创建(在我的示例中为4)一个新的数据框,它具有与ts数据框的日常时间序列的差异。当我尝试在我的真实数据集上运行它时,这种方法有效,但速度极慢,其中对有40 000行,ts大约150列。任何人都知道如何加快速度?我尝试使用lapply,但无法计算如何创建差异并将其存储在新的时间序列中。谢谢!
ts = data.frame("name" = c("A","B","C","D"),
"day1" = c(10,12,54,13),
"day2" = c(2,8,47,29),
"day3" = c(1,5,14,36)
)
ts$name <- as.character(ts$name)
答案 0 :(得分:1)
前面有几点评论:
i)data.frame()
有一个参数stringsAsFactors
,您可以将其设置为FALSE
,即:
pairs = data.frame(
"number" = c(1,2,3,4),
"name1" = c("A","B","C","D"),
"name2" = c("B","D","D","A"),
stringsAsFactors = FALSE
)
ii)加速代码并不是用apply
替换 for-loop 的问题,而是数据结构和处理效率的问题。依赖于在C ++内部循环而不是R或自己编写C ++代码的包/函数将为您提供最大的推动力。
iii)我还会在这里提供一个更大的虚拟示例,以便您和其他人可以更轻松地测试和比较时间:
# all combination of LETTERS, including identity pairs like A~A
pairs = cbind.data.frame(
"number" = seq(1, 676),
setNames(expand.grid(LETTERS, LETTERS), nm = c("name1", "name2"))
)
# expand.grid produces factor columns
pairs$name1 <- as.character(pairs$name1)
pairs$name2 <- as.character(pairs$name2)
ts = cbind.data.frame(
"name" = LETTERS,
matrix(sample.int(100, 150*26, replace = TRUE), ncol = 150),
stringsAsFactors = FALSE
)
names(ts)[-1] <- paste0("day", names(ts)[-1])
iv)循环的改进版本可能如下所示:
# initialize full matrix (since the ID is a number too), allocating necessary memory
diffs2 <- matrix(0, ncol = ncol(ts), nrow = nrow(pairs))
colnames(diffs2) <- colnames(ts)
# first column is given
diffs2[, 1] <- pairs$number
for (row in 1:nrow(pairs)) {
row1 <- as.vector(as.matrix(ts[ts$name==pairs[row,"name1"], -1]))
row2 <- as.vector(as.matrix(ts[ts$name==pairs[row,"name2"], -1]))
diffs2[row, -1] <- row1 - row2
}
这已经比你的速度快了好几倍,但是说明了时间序列的data.frame
对象的尴尬,应该改为是一个类的对象,允许更直接/有效地使用数字数据(有几个包提供时间序列类。)
现在,使用dplyr
和tidyr
,答案仍然相当简单但相当快:
# simple way of measuring time
start <- Sys.time()
xx <- tidyr::gather(ts, key = "day", value = "value", 2:151)
yy <- dplyr::left_join(pairs, xx, by = c("name1" = "name"))
zz <- dplyr::left_join(yy, xx, by = c("name2" = "name", "day" = "day"))
res <- dplyr::mutate(zz, diff = value.x - value.y)
end <- Sys.time()
duration <- end - start
持续时间
时差0.06700397秒
您还可以尝试前两个答案中的方法,很明显mapply
解决方案会很慢而且data.table
一个尚未完全正常工作看起来更慢更复杂。
答案 1 :(得分:0)
我有一个data.table解决方案可以提供帮助。 我们的想法是切换到长格式以便能够使用分组操作(相当于应用)并创建排列列来制作对:
name1idx <- unlist(lapply(pairs$name1,function(x){grep(x,ts$name)}))
name2idx <- unlist(lapply(pairs$name2,function(x){grep(x,ts$name)}))
plouf <-melt(setDT(ts),measure.vars = patterns("^day"),variable.name = "day")
plouf[,name1 := name[name1idx],by = day]
plouf[,value1 := value[name1idx],by = day]
plouf[,name2 := name[name2idx],by = day]
plouf[,value2 := value[name2idx],by = day]
plouf[,diff := value1 - value2]
plouf[,.(day,diff),by = .(name1,name2)]
name1 name2 day diff
1: A B day1 -2
2: A B day2 -6
3: A B day3 -4
4: B D day1 -1
5: B D day2 -21
6: B D day3 -31
7: C D day1 41
8: C D day2 18
9: C D day3 -22
10: D A day1 3
11: D A day2 27
12: D A day3 35
name1idx
和name1idx
是与ts$name
和pairs$name1
对应的pairs$name2
的索引。所有配对都可以这样。
答案 2 :(得分:0)
我一直在寻找一种解决方案,其中列名的使用是动态的,除name
之外不使用列名。 mapply
,dplyr
和reshape2
已用于此解决方案。
# library(reshape2)
# A function which will filter value based on pairs
matchPair <- function(x, y){
matchedRow <- ts %>%
filter(name == x | name == y) %>%
select(-name)
data.frame(diff(as.matrix(matchedRow))) %>%
mutate(name = paste0(x, '~',y))
}
df.r <-do.call(rbind,mapply(matchPair, pairs$name1, pairs$name2,
SIMPLIFY = FALSE))
# Row names are not meaningful. Hence remove those.
row.names(df.r) <- NULL
#Result
#> df.r
# day1 day2 day3 name
#1 2 6 4 A~B
#2 1 21 31 B~D
#3 -41 -18 22 C~D
#4 3 27 35 D~A