我试图找到一种方法来折叠具有相交范围的行,用“start”和“stop”列表示,并将折叠的值记录到新列中。例如,我有这个数据框:
my.df<- data.frame(chrom=c(1,1,1,1,14,16,16), name=c("a","b","c","d","e","f","g"), start=as.numeric(c(0,70001,70203,70060, 40004, 50000872, 50000872)), stop=as.numeric(c(71200,71200,80001,71051, 42004, 50000890, 51000952)))
chrom name start stop
1 a 0 71200
1 b 70001 71200
1 c 70203 80001
1 d 70060 71051
14 e 40004 42004
16 f 50000872 50000890
16 g 50000872 51000952
我正在尝试找到重叠范围并记录“start”和“stop”中折叠重叠行所涵盖的最大范围以及折叠行的名称,所以我会得到这个:
chrom start stop name
1 70001 80001 a,b,c,d
14 40004 42004 e
16 50000872 51000952 f,g
我想我可以像这样使用IRanges包:
library(IRanges)
ranges <- split(IRanges(my.df$start, my.df$stop), my.df$chrom)
但是我在收到折叠的列时遇到了麻烦:我尝试过使用findOvarlaps但是这个
ov <- findOverlaps(ranges, ranges, type="any")
但我不认为这是对的。
非常感谢任何帮助。
答案 0 :(得分:12)
IRanges
是这项工作的一个很好的候选人。无需使用chrom变量。
ir <- IRanges(my.df$start, my.df$stop)
## I create a new grouping variable Note the use of reduce here(performance issue)
my.df$group2 <- subjectHits(findOverlaps(ir, reduce(ir)))
# chrom name start stop group2
# 1 1 a 70001 71200 2
# 2 1 b 70203 80001 2
# 3 1 c 70060 71051 2
# 4 14 d 40004 42004 1
# 5 16 e 50000872 50000890 3
# 6 16 f 50000872 51000952 3
新的group2变量是范围指示符。现在使用data.table
我无法将数据转换为所需的输出:
library(data.table)
DT <- as.data.table(my.df)
DT[, list(start=min(start),stop=max(stop),
name=list(name),chrom=unique(chrom)),
by=group2]
# group2 start stop name chrom
# 1: 2 70001 80001 a,b,c 1
# 2: 1 40004 42004 d 14
# 3: 3 50000872 51000952 e,f 16
PS:此处折叠的变量名称不是字符串,而是列表的因子。这比使用粘贴的折叠字符更有效且更容易访问。
在OP澄清之后编辑,我将通过chrom创建组变量。我的意思是现在为每个chrom组调用Iranges代码。我略微修改你的数据,创建同一染色体的区间组。
my.df<- data.frame(chrom=c(1,1,1,1,14,16,16),
name=c("a","b","c","d","e","f","g"),
start=as.numeric(c(0,3000,70203,70060, 40004, 50000872, 50000872)),
stop=as.numeric(c(1,5000,80001,71051, 42004, 50000890, 51000952)))
library(data.table)
DT <- as.data.table(my.df)
## find interval for each chromsom
DT[,group := {
ir <- IRanges(start, stop);
subjectHits(findOverlaps(ir, reduce(ir)))
},by=chrom]
## Now I group by group and chrom
DT[, list(start=min(start),stop=max(stop),name=list(name),chrom=unique(chrom)),
by=list(group,chrom)]
group chrom start stop name chrom
1: 1 1 0 1 a 1
2: 2 1 3000 5000 b 1
3: 3 1 70060 80001 c,d 1
4: 1 14 40004 42004 e 14
5: 1 16 50000872 51000952 f,g 16
答案 1 :(得分:5)
对数据进行排序后,您可以轻松测试间隔是否与前一个间隔重叠,
并为每组重叠间隔分配标签。
获得这些标签后,您可以使用ddply
来汇总数据。
d <- data.frame(
chrom = c(1,1,1,14,16,16),
name = c("a","b","c","d","e","f"),
start = as.numeric(c(70001,70203,70060, 40004, 50000872, 50000872)),
stop = as.numeric(c(71200,80001,71051, 42004, 50000890, 51000952))
)
# Make sure the data is sorted
d <- d[ order(d$start), ]
# Check if a record should be linked with the previous
d$previous_stop <- c(NA, d$stop[-nrow(d)])
d$previous_stop <- cummax(ifelse(is.na(d$previous_stop),0,d$previous_stop))
d$new_group <- is.na(d$previous_stop) | d$start >= d$previous_stop
# The number of the current group of records is the number of times we have switched to a new group
d$group <- cumsum( d$new_group )
# We can now aggregate the data
library(plyr)
ddply(
d, "group", summarize,
start=min(start), stop=max(stop), name=paste(name,collapse=",")
)
# group start stop name
# 1 1 0 80001 a,d,c,b
# 2 2 50000872 51000952 e,f
但这忽略了chrom
列:为了解释它,你可以分别为每条染色体做同样的事情。
d <- d[ order(d$chrom, d$start), ]
d <- ddply( d, "chrom", function(u) {
x <- c(NA, u$stop[-nrow(u)])
y <- ifelse( is.na(x), 0, x )
y <- cummax(y)
y[ is.na(x) ] <- NA
u$previous_stop <- y
u
} )
d$new_group <- is.na(d$previous_stop) | d$start >= d$previous_stop
d$group <- cumsum( d$new_group )
ddply(
d, .(chrom,group), summarize,
start=min(start), stop=max(stop), name=paste(name,collapse=",")
)
# chrom group start stop name
# 1 1 1 0 80001 a,c,b
# 2 14 2 40004 42004 d
# 3 16 3 50000872 51000952 e,f