此帖子类似于this one,但使用的方法不同。我有两个数据框X和Y,在这里向您展示:
X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
Start = c(0, 540, 920, 0, 582, 715 ),
Stop = c(230, 720, 1270, 350, 635, 950))
Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
Start = c(3, 16, 180,
15, 585, 800 ),
Stop = c(15, 24, 201,
102, 612, 850),
Dif = c(12, 8, 21,
87, 27, 50))
我想获得Z,即:
Z <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
Start = c(0, 540, 920, 0, 582, 715 ),
Stop = c(230, 720, 1270, 350, 635, 950),
Count = c(3, 0, 0, 1, 1, 1)
Mean = c(13.66, 0, 0, 87, 27, 50))
哪个是
V1 = X$V1
Start = X$Start
Stop = X$Stop
Count = X的坐标开始/停止范围内的Y行数 我从中获得的:
library(tidyverse)
X %>%
mutate(Count = pmap_int(list(V1, Start, Stop), ~filter(Y, V1 == ..1, Start >= ..2, Stop <=..3) %>% nrow))
均值= Y $ Start和Y $ Stop之间在上述范围内的Dif的平均值(在第一种情况下,它是12 + 8 + 21/3 = 13.66,因为这三个是第一个之间在Y中的三种货币X的范围。
我不知道如何获取该列Mean,因为当我尝试使用与Count列类似的方法时,我不知道如何使用mean()而不会出错。 >
答案 0 :(得分:1)
这是我的解决方法。
require("sqldf")
X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
Start = c(0, 540, 920, 0, 582, 715 ),
Stop = c(230, 720, 1270, 350, 635, 950))
Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
Start = c(3, 16, 180,
15, 585, 800 ),
Stop = c(15, 24, 201,
102, 612, 850),
Dif = c(12, 8, 21,
87, 27, 50))
Z <- sqldf("select a.*
-- ,b.Start as Y_Start
-- ,b.Stop as Y_Stop
-- ,b.Dif
,sum(case when b.Start is not null then 1 else 0 end) as Count
,avg(coalesce(b.Dif,0)) as Mean
from X as a
left join Y as b
on a.V1 = b.V1
and a.Start < b.Start
and a.Stop > b.Stop
group by a.V1, a.Start, a.Stop
")
这是输出:
> Z
V1 Start Stop Count Mean
1 chr1 0 230 3 13.66667
2 chr1 540 720 0 0.00000
3 chr1 920 1270 0 0.00000
4 chr2 0 350 1 87.00000
5 chr2 582 635 1 27.00000
6 chr2 715 950 1 50.00000
答案 1 :(得分:1)
考虑基数R的merge
:
# MERGE X AND Y AND CALCULATE Count AND Dif SUBSET
mdf <- within(merge(X, Y, by="V1", suffixes=c("", "_"), sort=FALSE), {
Count <- as.integer(Start <= Start_ & Stop_ <= Stop)
Dif_sub <- ifelse(Start <= Start_ & Stop_ <= Stop, Dif, NA)
})
# MERGE (LEFT JOIN) AGGREGATIONS OF Count AND Mean
aggdf <- merge(aggregate(Count ~ V1 + Start + Stop, mdf, sum),
aggregate(cbind(Mean=Dif_sub) ~ V1 + Start + Stop, mdf, mean),
by=c("V1", "Start", "Stop"), all.x=TRUE)
aggdf
# V1 Start Stop Count Mean
# 1 chr1 0 230 3 13.66667
# 2 chr1 540 720 0 NA
# 3 chr1 920 1270 0 NA
# 4 chr2 0 350 1 87.00000
# 5 chr2 582 635 1 27.00000
# 6 chr2 715 950 1 50.00000