Apply-family在两个列表上(以避免嵌套的for循环)

时间:2016-12-01 09:32:16

标签: r

假设我有以下内容:

myseq <- seq(0, 1, by = 0.1)
scores <- sample(seq(0, 1, by = 0.01), 10)
var1 <- sample(c(0,1), 10, replace = T)
var2 <- sample(c(0,1), 10, replace = T)
mydf <- data.frame(scores = scores, var1 = var1, var2 = var2)

myseq
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

mydf
  scores var1 var2
1   0.10    1    0
2   0.06    1    0
3   0.74    0    0
4   0.15    1    0
5   0.40    1    1
6   0.96    0    0
7   0.04    1    0
8   0.71    0    1
9   0.94    1    1
10  0.38    0    0

对于myseq中的每个值,我想对var1大于var2中的值的记录子集求和scoresmyseq

我只想使用apply-family函数(apply,lapply,tapply,sapply,mapply等)。换句话说,没有嵌套的for循环。

所以,例如:

myseq中的第一个值是0.0。所有scores都大于0.0,因此我想返回var1 = 6var2 = 3

myseq中的第二个值是0.1。 10个scores中只有7个大于0.1,因此我想返回var1 = 3var2 = 3

......等等......

最后,我希望最终输出为11(r)x 2(c)矩阵(或数据框或列表),其中包含每个var的总和。

var1 var2
   6    3
   3    3
   ...
   ...

注意:11(r)是因为myseq的长度是11; 2(c)是因为有两个变量,var1var2

5 个答案:

答案 0 :(得分:2)

这样的东西?

res<-t(sapply(myseq,function(x){apply(mydf[scores>x,2:3],2,sum)}))

答案 1 :(得分:2)

一个想法,

t(sapply(lapply(myseq, function(i) mydf[mydf$scores >= i,-1]), function(j) colSums(j)))
 #       var1 var2
 #[1,]    6    7
 #[2,]    6    7
 #[3,]    6    7
 #[4,]    6    6
 #[5,]    3    4

答案 2 :(得分:2)

tidyverse解决方案:

myseq <- seq(0, 1, by = 0.1)
scores <- sample(seq(0, 1, by = 0.01), 10)
var1 <- sample(c(0,1), 10, replace = T)
var2 <- sample(c(0,1), 10, replace = T)
mydf <- data.frame(scores = scores, var1 = var1, var2 = var2)

myseq
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

mydf
##    scores var1 var2
## 1    0.85    0    0
## 2    0.06    1    0
## 3    0.23    1    1
## 4    0.98    1    1
## 5    0.32    0    1
## 6    0.58    0    0
## 7    0.45    0    0
## 8    0.90    1    1
## 9    0.22    1    1
## 10   0.15    0    0

library(purrr)
library(dplyr)

map_df(myseq, ~filter(mydf, scores>.) %>% summarise_each(funs(sum), -scores))
##    var1 var2
## 1     5    5
## 2     4    5
## 3     4    5
## 4     2    3
## 5     2    2
## 6     2    2
## 7     2    2
## 8     2    2
## 9     2    2
## 10    1    1
## 11    0    0

答案 3 :(得分:1)

避免过度计算的另一种选择:

订购分数并找到“myseq”的每个元素都大于“分数”的索引:

o = order(mydf$scores)

i = findInterval(myseq, mydf$scores[o])
z = rep_len(0L, sum(!i)) #zeroes to add, later on, because x[0] results in 0-length 

仅计算一次连续总和:

csv1 = cumsum(mydf$var1[o]) 
csv2 = cumsum(mydf$var2[o])

适当地对总结进行子集化(我使用set.seed(1821)生成数据):

csv1[length(csv1)] - c(z, csv1[i])
# [1] 8 7 6 6 6 5 3 3 2 1 0
csv2[length(csv2)] - c(z, csv2[i])
# [1] 6 5 5 5 5 3 2 2 1 1 0

由于您提到&gt; 2个变量,最后的操作可以用

代替
sapply(mydf[-1], function(x) { cs = cumsum(x[o]); cs[length(cs)] - c(z, cs[i]) })

答案 4 :(得分:0)

您可以尝试使用数据表:

require(data.table)
set.seed(5)
myseq <- seq(0, 1, by = 0.1)
scores <- sample(seq(0, 1, by = 0.01), 10)
var1 <- sample(c(0,1), 10, replace = T)
var2 <- sample(c(0,1), 10, replace = T)
mydf <- data.frame(scores = scores, var1 = var1, var2 = var2)

setDT(mydf)
result <- t(sapply(myseq, function(x){ mydf[scores > x, lapply(.SD[, -1, with = F], sum)]}))

> result
      var1      var2     
 [1,] 4         4        
 [2,] 4         4        
 [3,] 4         3        
 [4,] 3         3        
 [5,] 3         3        
 [6,] 3         3        
 [7,] 3         3        
 [8,] 3         2        
 [9,] 2         1        
[10,] 1         1        
[11,] Numeric,0 Numeric,0