计算与R中的另一个数据帧重叠的值的总和

时间:2018-11-27 17:39:43

标签: r

我有以下两个数据框

depth
chr  Pos Nucleotide Coverage
chr1 1   A          10
chr1 2   G          12
chr1 3   T          3
chr1 4   A          20
chr1 5   T          22
chr1 6   N          0
chr1 7   N          0
chr2 23  A          1
chr2 24  T          5
chr2 25  G          15

和另一个间隔的数据框

intervals

chr1  3  5
chr2 23 25
chr4  1 30

我期望的输出如下:如果depth数据帧中的位置落在intervals数据帧指示的范围内且具有相同的chr值,则{{1 }}计算出该范围内所有核苷酸的总和,并将其分配到第四列。

Coverage

chr1  3  5 45
chr2 23 25 21
chr4  1 30  0

如何使用R创建这两个数据帧。对于深度数据帧,我有非常大的文件,大小为50GB。

2 个答案:

答案 0 :(得分:1)

您可以使用sqldf

library(sqldf)

out1 <- sqldf('
select    i.*
          , coalesce(sum(d.Coverage), 0) as CovSum
from      intervals i
          left join depth d
            on  d.Pos between i.low and i.high
                and d.chr = i.chr
group by  i.chr, i.low, i.high
')
out1        
#    chr low high CovSum
# 1 chr1   3    5     45
# 2 chr2  23   25     21
# 3 chr4   1   30      0

out2 <- sqldf('
select    d.*
from      intervals i
          join depth d
            on  d.Pos between i.low and i.high
                and d.chr = i.chr
')
out2
#    chr Pos Nucleotide Coverage
# 1 chr1   3          T        3
# 2 chr1   4          A       20
# 3 chr1   5          T       22
# 4 chr2  23          A        1
# 5 chr2  24          T        5
# 6 chr2  25          G       15

使用的数据

library(data.table)

depth <- fread('
chr  Pos Nucleotide Coverage
chr1 1   A          10
chr1 2   G          12
chr1 3   T          3
chr1 4   A          20
chr1 5   T          22
chr1 6   N          0
chr1 7   N          0
chr2 23  A          1
chr2 24  T          5
chr2 25  G          15
')

intervals <- fread('
chr   low high
chr1  3  5
chr2 23 25
chr4  1 30
')

答案 1 :(得分:0)

dplyr非常适合以下操作:

# first, read in the data, with headers
depth <- read.table(header = T, text = 
"chr  Pos Nucleotide Coverage
chr1 1   A          10
chr1 2   G          12
chr1 3   T          3
chr1 4   A          20
chr1 5   T          22
chr1 6   N          0
chr1 7   N          0
chr2 23  A          1
chr2 24  T          5
chr2 25  G          15")

intervals <- read.table(header = T, text =
"chr  start   end
chr1  3  5
chr2 23 25
chr4  1 30")

现在您可以开始工作:

library(dplyr)
# create a new data.frame:
# link intervals with any rows from depth where the value of 'chr' matches
# (keeping all rows from intervals)

merged <-
  merge(intervals, depth, by = 'chr', all.x = T) %>%

  mutate(
    # add a column to flag rows in the range spec'd by intervals
    in_range = Pos >= start & Pos <= end,
    # substitute 0 for any missing values in Coverage
    Coverage = coalesce(Coverage, 0L))

# now you can get your results:

result1 <- 
  merged %>% 
  # keep those in range or with no value from depth$Pos
  filter(in_range | is.na(Pos)) %>%
  group_by(chr, start, end) %>%
  summarise(sum_cov = sum(Coverage))

result2 <-
  merged %>%
  # keep those in range
  filter(in_range ==T) %>%
  # only get the columns that were in depth
  select(names(depth))

结果符合您的预期:

> result1
  chr   start   end sum_cov
1 chr1      3     5      45
2 chr2     23    25      21
3 chr4      1    30       0

> result2
   chr Pos Nucleotide Coverage
1 chr1   3          T        3
2 chr1   4          A       20
3 chr1   5          T       22
4 chr2  23          A        1
5 chr2  24          T        5
6 chr2  25          G       15