我有以下两个数据框
depth
chr Pos Nucleotide Coverage
chr1 1 A 10
chr1 2 G 12
chr1 3 T 3
chr1 4 A 20
chr1 5 T 22
chr1 6 N 0
chr1 7 N 0
chr2 23 A 1
chr2 24 T 5
chr2 25 G 15
和另一个间隔的数据框
intervals
chr1 3 5
chr2 23 25
chr4 1 30
我期望的输出如下:如果depth
数据帧中的位置落在intervals
数据帧指示的范围内且具有相同的chr
值,则{{1 }}计算出该范围内所有核苷酸的总和,并将其分配到第四列。
Coverage
和
chr1 3 5 45
chr2 23 25 21
chr4 1 30 0
如何使用R创建这两个数据帧。对于深度数据帧,我有非常大的文件,大小为50GB。
答案 0 :(得分:1)
您可以使用sqldf
library(sqldf)
out1 <- sqldf('
select i.*
, coalesce(sum(d.Coverage), 0) as CovSum
from intervals i
left join depth d
on d.Pos between i.low and i.high
and d.chr = i.chr
group by i.chr, i.low, i.high
')
out1
# chr low high CovSum
# 1 chr1 3 5 45
# 2 chr2 23 25 21
# 3 chr4 1 30 0
out2 <- sqldf('
select d.*
from intervals i
join depth d
on d.Pos between i.low and i.high
and d.chr = i.chr
')
out2
# chr Pos Nucleotide Coverage
# 1 chr1 3 T 3
# 2 chr1 4 A 20
# 3 chr1 5 T 22
# 4 chr2 23 A 1
# 5 chr2 24 T 5
# 6 chr2 25 G 15
使用的数据
library(data.table)
depth <- fread('
chr Pos Nucleotide Coverage
chr1 1 A 10
chr1 2 G 12
chr1 3 T 3
chr1 4 A 20
chr1 5 T 22
chr1 6 N 0
chr1 7 N 0
chr2 23 A 1
chr2 24 T 5
chr2 25 G 15
')
intervals <- fread('
chr low high
chr1 3 5
chr2 23 25
chr4 1 30
')
答案 1 :(得分:0)
dplyr
非常适合以下操作:
# first, read in the data, with headers
depth <- read.table(header = T, text =
"chr Pos Nucleotide Coverage
chr1 1 A 10
chr1 2 G 12
chr1 3 T 3
chr1 4 A 20
chr1 5 T 22
chr1 6 N 0
chr1 7 N 0
chr2 23 A 1
chr2 24 T 5
chr2 25 G 15")
intervals <- read.table(header = T, text =
"chr start end
chr1 3 5
chr2 23 25
chr4 1 30")
现在您可以开始工作:
library(dplyr)
# create a new data.frame:
# link intervals with any rows from depth where the value of 'chr' matches
# (keeping all rows from intervals)
merged <-
merge(intervals, depth, by = 'chr', all.x = T) %>%
mutate(
# add a column to flag rows in the range spec'd by intervals
in_range = Pos >= start & Pos <= end,
# substitute 0 for any missing values in Coverage
Coverage = coalesce(Coverage, 0L))
# now you can get your results:
result1 <-
merged %>%
# keep those in range or with no value from depth$Pos
filter(in_range | is.na(Pos)) %>%
group_by(chr, start, end) %>%
summarise(sum_cov = sum(Coverage))
result2 <-
merged %>%
# keep those in range
filter(in_range ==T) %>%
# only get the columns that were in depth
select(names(depth))
结果符合您的预期:
> result1
chr start end sum_cov
1 chr1 3 5 45
2 chr2 23 25 21
3 chr4 1 30 0
> result2
chr Pos Nucleotide Coverage
1 chr1 3 T 3
2 chr1 4 A 20
3 chr1 5 T 22
4 chr2 23 A 1
5 chr2 24 T 5
6 chr2 25 G 15