R:在一个数据框中基于另一数据框计算货币数量,并计算范围之间的平均值

时间:2019-10-30 16:46:03

标签: r dataframe merge

此帖子类似于this one,但使用的方法不同。我有两个数据框X和Y,在这里向您展示:

X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
                Start = c(0, 540, 920, 0, 582, 715 ),
                Stop = c(230, 720, 1270, 350, 635, 950))

Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
                Start = c(3, 16, 180,
                          15, 585, 800 ),
                Stop = c(15, 24, 201,
                         102, 612, 850),
                Dif = c(12, 8, 21,
                        87, 27, 50))

我想获得Z,即:

Z <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
                Start = c(0, 540, 920, 0, 582, 715 ),
                Stop = c(230, 720, 1270, 350, 635, 950),
                Count = c(3, 0, 0, 1, 1, 1)
                Mean = c(13.66, 0, 0, 87, 27, 50))

哪个是

V1 = X$V1
Start = X$Start
Stop = X$Stop

Count = X的坐标开始/停止范围内的Y行数 我从中获得的:

    library(tidyverse)
    X %>%
    mutate(Count = pmap_int(list(V1, Start, Stop), ~filter(Y, V1 == ..1,  Start >= ..2, Stop <=..3) %>% nrow))

均值= Y $ Start和Y $ Stop之间在上述范围内的Dif的平均值(在第一种情况下,它是12 + 8 + 21/3 = 13.66,因为这三个是第一个之间在Y中的三种货币X的范围。

我不知道如何获取该列Mean,因为当我尝试使用与Count列类似的方法时,我不知道如何使用mean()而不会出错。 >

2 个答案:

答案 0 :(得分:1)

这是我的解决方法。

require("sqldf")

X <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
                Start = c(0, 540, 920, 0, 582, 715 ),
                Stop = c(230, 720, 1270, 350, 635, 950))

Y <- data.frame(V1 = c("chr1", "chr1", "chr1", "chr2", "chr2", "chr2"),
                Start = c(3, 16, 180,
                          15, 585, 800 ),
                Stop = c(15, 24, 201,
                         102, 612, 850),
                Dif = c(12, 8, 21,
                        87, 27, 50))


Z <- sqldf("select a.*
                  -- ,b.Start as Y_Start
                  -- ,b.Stop as Y_Stop
                  -- ,b.Dif
                  ,sum(case when b.Start is not null then 1 else 0 end) as Count
                  ,avg(coalesce(b.Dif,0)) as Mean
           from X as a
           left join Y as b
           on a.V1 = b.V1
           and a.Start < b.Start
           and a.Stop > b.Stop
           group by a.V1, a.Start, a.Stop
           ")

这是输出:

> Z
    V1 Start Stop Count     Mean
1 chr1     0  230     3 13.66667
2 chr1   540  720     0  0.00000
3 chr1   920 1270     0  0.00000
4 chr2     0  350     1 87.00000
5 chr2   582  635     1 27.00000
6 chr2   715  950     1 50.00000

答案 1 :(得分:1)

考虑基数R的merge

# MERGE X AND Y AND CALCULATE Count AND Dif SUBSET
mdf <- within(merge(X, Y, by="V1", suffixes=c("", "_"), sort=FALSE), {    
         Count <- as.integer(Start <= Start_ & Stop_ <= Stop) 
         Dif_sub <- ifelse(Start <= Start_ & Stop_ <= Stop, Dif, NA)
    })

# MERGE (LEFT JOIN) AGGREGATIONS OF Count AND Mean
aggdf <-  merge(aggregate(Count ~ V1 + Start + Stop, mdf, sum),
                aggregate(cbind(Mean=Dif_sub) ~ V1 + Start + Stop, mdf, mean),
                by=c("V1", "Start", "Stop"), all.x=TRUE)
aggdf
#     V1 Start Stop Count     Mean
# 1 chr1     0  230     3 13.66667
# 2 chr1   540  720     0       NA
# 3 chr1   920 1270     0       NA
# 4 chr2     0  350     1 87.00000
# 5 chr2   582  635     1 27.00000
# 6 chr2   715  950     1 50.00000

Online Demo