比较两个表的元素,对现有元素求平均值,并在R中保留不存在的NA

时间:2014-10-21 09:31:58

标签: r compare average

我有两个表,第一个表(T1)代表数字范围,第二个表(T2)包括坐标和分数,它是T1第一列的细分。

我想计算T2的score的平均值,并插入T1关于范围,如果相应的坐标不可用则放NA。让我们说:

表1:(T1)

    start    end    
    1000    1100
    1300    1390
    1530    1610
    1800    1905

表2:(T2)

coordinate  score
1002         3
1004         1
1020         5
1087         4
1550         1
1559         7
1609         3
1805        2.5

结果:在T1范围内平均T2的元素:ex:1000 to 1100 (3+1+5+1)/4并且1300 to 1390之间没有得分NA值等等。

start    end  mean-score  
1000    1100   3.25
1300    1390   NA
1530    1610   3.66
1800    1905   2.5
你可以帮我在R?

中实现它吗?

感谢。

3 个答案:

答案 0 :(得分:4)

@akrun提示,我在“data.table”中遇到了foverlaps函数。如果这是最好的方法,我不肯定(但它有效: - ))

library(data.table)
T1 <- as.data.table(T1)
T2 <- as.data.table(T2)
setkey(T1, start, end)
T2[, c("start", "end") := coordinate]
foverlaps(T2, T1)[, list(score = mean(score)), by = list(start, end)]
#    start  end    score
# 1:  1000 1100 3.250000
# 2:  1530 1610 3.666667
# 3:  1800 1905 2.500000

更新

正如@Arun在评论中所提到的,如果您也在T2上设置了密钥,并更改了foverlaps的顺序,那么您也可以获得NA

setkey(T2, start, end)
foverlaps(T1, T2)[, list(mean = mean(score)), by = list(i.start, i.end)]
#    i.start i.end     mean
# 1:    1000  1100 3.250000
# 2:    1300  1390       NA
# 3:    1530  1610 3.666667
# 4:    1800  1905 2.500000

答案 1 :(得分:3)

一种方法是

T1$mean_score <- sapply(seq_len(nrow(T1)), function(i) {x1 <- T1[i,]
                  mean(T2$score[T2$coordinate>x1[,1]& T2$coordinate<=x1[,2]])})

 T1
 #  start  end mean_score
#1  1000 1100   3.250000
#2  1300 1390        NaN
#3  1530 1610   3.666667
#4  1800 1905   2.500000

数据

T1 <- structure(list(start = c(1000L, 1300L, 1530L, 1800L), end = c(1100L, 
 1390L, 1610L, 1905L)), .Names = c("start", "end"), class = "data.frame", row.names = c(NA, 
 -4L))


T2 <-  structure(list(coordinate = c(1002L, 1004L, 1020L, 1087L, 1550L, 
 1559L, 1609L, 1805L), score = c(3, 1, 5, 4, 1, 7, 3, 2.5)), .Names = c("coordinate", 
 "score"), class = "data.frame", row.names = c(NA, -8L))

答案 2 :(得分:2)

使用dplyr函数rowwisedobetween的可能性。

library(dplyr)

T1 %>%
  rowwise() %>%
  do(data.frame(., mean_score = mean(T2$score[between(T2$coordinate, left = .$start, right = .$end)])))
#   start  end mean_score
# 1  1000 1100   3.250000
# 2  1300 1390        NaN
# 3  1530 1610   3.666667
# 4  1800 1905   2.500000