我有两个表,第一个表(T1)代表数字范围,第二个表(T2)包括坐标和分数,它是T1第一列的细分。
我想计算T2的score
的平均值,并插入T1关于范围,如果相应的坐标不可用则放NA
。让我们说:
表1:(T1)
start end
1000 1100
1300 1390
1530 1610
1800 1905
表2:(T2)
coordinate score
1002 3
1004 1
1020 5
1087 4
1550 1
1559 7
1609 3
1805 2.5
结果:在T1范围内平均T2的元素:ex:1000 to 1100 (3+1+5+1)/4
并且1300 to 1390
之间没有得分NA
值等等。
start end mean-score
1000 1100 3.25
1300 1390 NA
1530 1610 3.66
1800 1905 2.5
你可以帮我在R?中实现它吗?
感谢。
答案 0 :(得分:4)
@akrun提示,我在“data.table”中遇到了foverlaps
函数。如果这是最好的方法,我不肯定(但它有效: - ))
library(data.table)
T1 <- as.data.table(T1)
T2 <- as.data.table(T2)
setkey(T1, start, end)
T2[, c("start", "end") := coordinate]
foverlaps(T2, T1)[, list(score = mean(score)), by = list(start, end)]
# start end score
# 1: 1000 1100 3.250000
# 2: 1530 1610 3.666667
# 3: 1800 1905 2.500000
更新
正如@Arun在评论中所提到的,如果您也在T2上设置了密钥,并更改了foverlaps
的顺序,那么您也可以获得NA
。
setkey(T2, start, end)
foverlaps(T1, T2)[, list(mean = mean(score)), by = list(i.start, i.end)]
# i.start i.end mean
# 1: 1000 1100 3.250000
# 2: 1300 1390 NA
# 3: 1530 1610 3.666667
# 4: 1800 1905 2.500000
答案 1 :(得分:3)
一种方法是
T1$mean_score <- sapply(seq_len(nrow(T1)), function(i) {x1 <- T1[i,]
mean(T2$score[T2$coordinate>x1[,1]& T2$coordinate<=x1[,2]])})
T1
# start end mean_score
#1 1000 1100 3.250000
#2 1300 1390 NaN
#3 1530 1610 3.666667
#4 1800 1905 2.500000
T1 <- structure(list(start = c(1000L, 1300L, 1530L, 1800L), end = c(1100L,
1390L, 1610L, 1905L)), .Names = c("start", "end"), class = "data.frame", row.names = c(NA,
-4L))
T2 <- structure(list(coordinate = c(1002L, 1004L, 1020L, 1087L, 1550L,
1559L, 1609L, 1805L), score = c(3, 1, 5, 4, 1, 7, 3, 2.5)), .Names = c("coordinate",
"score"), class = "data.frame", row.names = c(NA, -8L))
答案 2 :(得分:2)
使用dplyr
函数rowwise
,do
和between
的可能性。
library(dplyr)
T1 %>%
rowwise() %>%
do(data.frame(., mean_score = mean(T2$score[between(T2$coordinate, left = .$start, right = .$end)])))
# start end mean_score
# 1 1000 1100 3.250000
# 2 1300 1390 NaN
# 3 1530 1610 3.666667
# 4 1800 1905 2.500000