使用apply family(大数据集)中的函数替换r中的循环

时间:2018-01-09 20:45:14

标签: r for-loop lapply

我正在研究一个问题,我们正在尝试为大量对(+40 000)创建时间序列差异,其中时间序列(每天+150点)

每一行代表我们想要比较的两个人

finalize

每一行代表特定个人的时间序列数据

pairs = data.frame("number" = c(1,2,3,4),
                   "name1" = c("A","B","C","D"),
                   "name2" = c("B","D","D","A")
                               )

pairs$name1 <- as.character(pairs$name1)   
pairs$name2 <- as.character(pairs$name2) 

我有以下R代码,其目标是为每对个体创建(在我的示例中为4)一个新的数据框,它具有与ts数据框的日常时间序列的差异。当我尝试在我的真实数据集上运行它时,这种方法有效,但速度极慢,其中对有40 000行,ts大约150列。任何人都知道如何加快速度?我尝试使用lapply,但无法计算如何创建差异并将其存储在新的时间序列中。谢谢!

ts = data.frame("name" = c("A","B","C","D"),
                        "day1" = c(10,12,54,13),
                        "day2" = c(2,8,47,29),
                        "day3" = c(1,5,14,36)
                             )
ts$name <- as.character(ts$name)   

3 个答案:

答案 0 :(得分:1)

前面有几点评论:

i)data.frame()有一个参数stringsAsFactors,您可以将其设置为FALSE,即:

pairs = data.frame(
    "number" = c(1,2,3,4),
    "name1" = c("A","B","C","D"),
    "name2" = c("B","D","D","A"),
    stringsAsFactors = FALSE
)

ii)加速代码并不是用apply替换 for-loop 的问题,而是数据结构和处理效率的问题。依赖于在C ++内部循环而不是R或自己编写C ++代码的包/函数将为您提供最大的推动力。

iii)我还会在这里提供一个更大的虚拟示例,以便您和其他人可以更轻松地测试和比较时间:

# all combination of LETTERS, including identity pairs like A~A
pairs = cbind.data.frame(
    "number" = seq(1, 676),
    setNames(expand.grid(LETTERS, LETTERS), nm = c("name1", "name2"))
)
# expand.grid produces factor columns
pairs$name1 <- as.character(pairs$name1)
pairs$name2 <- as.character(pairs$name2)

ts = cbind.data.frame(
    "name" = LETTERS,
    matrix(sample.int(100, 150*26, replace = TRUE), ncol = 150),
    stringsAsFactors = FALSE
)
names(ts)[-1] <- paste0("day", names(ts)[-1])

iv)循环的改进版本可能如下所示:

# initialize full matrix (since the ID is a number too), allocating necessary memory
diffs2 <- matrix(0, ncol = ncol(ts), nrow = nrow(pairs))
colnames(diffs2) <- colnames(ts)
# first column is given
diffs2[, 1] <- pairs$number

for (row in 1:nrow(pairs)) {
  row1 <- as.vector(as.matrix(ts[ts$name==pairs[row,"name1"], -1]))
  row2 <- as.vector(as.matrix(ts[ts$name==pairs[row,"name2"], -1]))
  diffs2[row, -1] <- row1 - row2
}

这已经比你的速度快了好几倍,但是说明了时间序列data.frame对象的尴尬,应该改为是一个类的对象,允许更直接/有效地使用数字数据(有几个包提供时间序列类。)

现在,使用dplyrtidyr,答案仍然相当简单但相当快:

# simple way of measuring time
start <- Sys.time()

xx <- tidyr::gather(ts, key = "day", value = "value", 2:151)
yy <- dplyr::left_join(pairs, xx, by = c("name1" = "name"))
zz <- dplyr::left_join(yy, xx, by = c("name2" = "name", "day" = "day"))
res <- dplyr::mutate(zz, diff = value.x - value.y)

end <- Sys.time()

duration <- end - start

持续时间

  

时差0.06700397秒

您还可以尝试前两个答案中的方法,很明显mapply解决方案会很慢而且data.table一个尚未完全正常工作看起来更慢更复杂。

答案 1 :(得分:0)

我有一个data.table解决方案可以提供帮助。 我们的想法是切换到长格式以便能够使用分组操作(相当于应用)并创建排列列来制作对:

name1idx <- unlist(lapply(pairs$name1,function(x){grep(x,ts$name)}))
name2idx <- unlist(lapply(pairs$name2,function(x){grep(x,ts$name)}))

plouf <-melt(setDT(ts),measure.vars = patterns("^day"),variable.name = "day")
plouf[,name1 := name[name1idx],by = day]
plouf[,value1 := value[name1idx],by = day]
plouf[,name2 := name[name2idx],by = day]
plouf[,value2 := value[name2idx],by = day]
plouf[,diff := value1 - value2]
plouf[,.(day,diff),by = .(name1,name2)]

    name1 name2  day diff
 1:     A     B day1   -2
 2:     A     B day2   -6
 3:     A     B day3   -4
 4:     B     D day1   -1
 5:     B     D day2  -21
 6:     B     D day3  -31
 7:     C     D day1   41
 8:     C     D day2   18
 9:     C     D day3  -22
10:     D     A day1    3
11:     D     A day2   27
12:     D     A day3   35

name1idxname1idx是与ts$namepairs$name1对应的pairs$name2的索引。所有配对都可以这样。

答案 2 :(得分:0)

我一直在寻找一种解决方案,其中列名的使用是动态的,除name之外不使用列名。 mapplydplyrreshape2已用于此解决方案。

# library(reshape2)
# A function which will filter value based on pairs
matchPair <- function(x, y){
  matchedRow <- ts %>% 
    filter(name == x | name == y) %>% 
    select(-name)

  data.frame(diff(as.matrix(matchedRow))) %>% 
    mutate(name = paste0(x, '~',y))
}

df.r <-do.call(rbind,mapply(matchPair, pairs$name1, pairs$name2,
               SIMPLIFY = FALSE))

# Row names are not meaningful. Hence remove those.
row.names(df.r) <- NULL

#Result
#> df.r
#  day1 day2 day3 name
#1    2    6    4  A~B
#2    1   21   31  B~D
#3  -41  -18   22  C~D
#4    3   27   35  D~A