Question

我正在尝试修改另一个列表中包含的一个基于列表的信息。问题的实质是一个列表list1包含数据框 - 每列两列，第1列（time）=时刻，第2列（score）填充0- - 另一个列表list2包含数据框 - 两列包含配对时间实例（例如0.12和0.125或1.54和1.57），或时间窗口的起点和终点。 list1中的每个数据帧在list2

中具有相应的（即同名）数据帧

目标是确定time中每个数据框的哪些时间瞬间（即来自list1列）确实落在list2中关联数据框中包含的任何时间窗口之间{1}}通过修改＆＃39;得分＆＃39;如果其关联的时刻在来自list1的相应（即同名）数据帧的至少一个配对时间之间，则来自list2中的数据帧（具有全零的数据帧）的列。最终结果基本上是list1 score列中的0，区分不落入任何时间窗口的时间瞬间。

以下是一些示例数据：

set.seed(1)
list1 <- split(d<-replicate(10,1:100+rnorm(100,0,0.1)), 
         ceiling(seq_along(d)/100))
list1 <- lapply(list1, function(x) data.frame(cbind(time = x, score = 0)))
names(list1) <- letters[1:10]

list2 <- replicate(10, sample(1:100, sample(1:20), replace=FALSE)) 
list2 <- lapply(list2, function(x) 
         data.frame(cbind(x, x + sample(runif(100,min=0.1,max=3),length(x)))))
names(list2) <- letters[1:10]

我能够拼凑一个或两个适用于小例子的解决方案，但是当我在更大的列表上尝试它（即真实list1中的几百万个时刻）时，我遇到内存错误

首先，我创建了一个函数，在给定两个适当的数据框的情况下做我想要的事情

testfxn1 <-function(df1, df2)
  {
    df1<-lapply(1:dim(df2)[1], function(x)
    {
    df1[which(df1[1] > df2[x,1] & df1[1] < df2[x,2]), 2] <- 1
    return(df1)
    })
    return(cbind(df1[[1]][1], 
           score = rowSums(do.call(cbind,lapply(df1,'[[',2)))))
  }

然后，我使用sapply将函数应用于整个列表：

sapply(names(list1), simplify=FALSE, function(x) return(testfxn1(list1[[x]], list2[[x]])))

它做我想要的（即在数据不在相关时间窗口之间留下0），但在我的真实数据中有许多数据帧在list1中有250,000 - 750,000个时刻，我无法分配足够的内存来完成任务。

有关如何更有效地完成此任务的任何想法？当然，我lapply中的第一个testfxn1电话必须是问题的重要组成部分。

Answer 1

如果我正确理解问题，这可能会快一点。您至少可以通过矢量化删除一组循环。

windows = rbind(list2[[1]], list2[[2]])
# for each time, look at all windows and see if the time t falls outsize every single window:
list1[[1]]$score = sapply(list1[[1]]$time, FUN=function(t){
    all(t > windows[,1] | t < windows[,2])
}) 
# same thing for the second dataframe in list1
# TRUE = in a window, FALSE = not in a window. Use as.numeric() to coerce to 0's and 1's.

Answer 2

@HillarySanders的回答给了我一个想法。

就我的目的而言，此答案会产生与testfxn1相同的相关信息，但此处需要一个后处理步骤，以便将此处的输出与原始list1时间数据相结合。

testfxn2 <- function(df1, df2)
  {
  sapply(df1$time, function(g)
    {
    any(g > df2[,1] & g < df2[,2])
    })
  }

然后sapply：

sapply(names(list1), simplify=FALSE, function(x) testfxn2(list1[[x]], list2[[x]]))

Answer 3

可能有更好的方法，但是肯定很大一部分内存紧缩是df1的许多副本（每个副本都有score值的少数而不是零，在适当情况下）。另一种方法是使用df1的单个副本，每当时间匹配时我们递增相应的条目。此外，这种方法最终节省了用sums重新创建df1的需要换句话说：

# alternative to original tesetfxn1() function
#   The idea is to increment the value in the qualifying rows in-situ rather
#   than producing as many copies of df1 as there are rows in df2 and having to
#   sum these up at the end.
testfxn1 <-function(df1, df2)
{
    for (x in 1:nrow(df2))  # I find this more explicit that 1:dim(df2)[1]
    {
        # Get "list" of qualifying rows
        selectRows <- which(df1[1] > df2[x,1] & df1[1] < df2[x,2])
        # Increment the corresponding row's score value
        if (length(selectRows) > 0) {
          df1[selectRows, 2] <- df1[selectRows, 2] + 1
          # or more explicitly...
          # df1[selectRows, ]$score <- df1[selectRows, ]$score + 1
        }
    }

    df1
}

Answer 4

不确定这是否会更快。

res <- setNames(lapply(names(list1), function(x) {
         x1 <- list2[[x]]
         x2 <- list1[[x]][, 1]
         x3 <- t(replicate(length(x2), x1[, 1]))
         x4 <- t(replicate(length(x2), x1[, 2]))
            data.frame(time = x2, score = rowSums(x2 > x3 & x2 < x4))
      }), names(list1))

 s1 <- sapply(names(list1), simplify=FALSE, function(x) return(testfxn1(list1[[x]], list2[[x]])))     

 identical(res, s1)
  #[1] TRUE

另一种方式是：

 library(data.table)
 x1 <- rbindlist(list1)
 x2 <- rbindlist(list2)

 #slower
 s3 <- Vectorize(function(x) x1[, time] > x)(x2[, x]) & Vectorize(function(y) x1[, 
time] < y)(x2[, V2])
 indx <- rep(names(list2), sapply(list2, dim)[1, ])
 indx2 <- seq(1, nrow(x1), by = 100)
 lst1 <- split(seq_len(ncol(s3)), indx)
 res1 <- setNames(lapply(seq_along(indx2), function(i) data.frame(time = list1[[i]][, 
1], score = rowSums(s3[indx2[i]:(indx2[i] + 99), lst1[[i]]]))), names(list1))

 identical(res,res1)
 #[1] TRUE

R：使用一个列表有效地修改另一个列表

4 个答案: