我的数据框如下:
df1 <- data.frame(Group = c("scaf1", "scaf1", "scaf1", "scaf2", "scaf2", "scaf2", "scaf3", "scaf3", "scaf4", "scaf4"),
Start = c(10, 40, 90, 50, 80, 95, 600, 800, 70, 100),
End = c(50, 70, 120, 70, 100, 150, 700, 850, 100, 145))
df1
# group start End
# scaf1 10 50
# scaf1 40 70
# scaf1 90 120
# scaf2 50 70
# scaf2 80 100
# scaf2 95 150
# scaf3 600 700
# scaf3 800 850
# scaf4 70 100
# scaf4 100 145
我想比较组内每一行的范围,并只保留那些重叠的范围。
例如,在小组scaf1
中,在第二行中,start
的值为40
,该值在先前范围Start = 10; End = 50
的范围内。因此,两行都被保留。
Start
中第三行90
的{{1}}在前一行(scaf1
)的范围内不。所以我希望得到以下结果:
Start = 40, End = 70
我尝试了以下命令,但失败了:
group start End
scaf1 10 50
scaf1 40 70
scaf2 80 100
scaf2 95 150
scaf4 70 100
scaf4 100 145
提前致谢。
答案 0 :(得分:1)
你走了:
df1=data.frame(Group=c("scaf1","scaf1","scaf1","scaf2","scaf2","scaf2","scaf3","scaf3","scaf4","scaf4"),Start=c(10,40,90,50,80,95,600,800,70,100),End=c(50,70,120,70,100,150,700,850,100,145))
df1$filter = F
for(k in 2:nrow(df1)){
if(df1$Group[k]==df1$Group[k-1] && df1$Start[k]<=df1$End[k-1]){
df1$filter[k-1]=T
df1$filter[k]=T
}
}
df2 = df1[df1$filter==T,]
df2$filter = NULL
这不是矢量化解决方案,而是按预期工作。
结果:
> df2
Group Start End
1 scaf1 10 50
2 scaf1 40 70
5 scaf2 80 100
6 scaf2 95 150
9 scaf4 70 100
10 scaf4 100 145
答案 1 :(得分:0)
获得输出的另一种方法是使用GenomicRanges包定义的范围。
library(GenomicRanges)
# create a GRanges object
df1_gr <- GRanges(df1$Group, IRanges(df1$Start, df1$End))
# find the overlaps
gr <- as.data.frame(findOverlaps(df1_gr))
# remove self-overlapping
gr <- gr[gr$queryHits != gr$subjectHits,]
# final dataset
df1[ gr$queryHits, ]
Group Start End
1: scaf1 10 50
2: scaf1 40 70
3: scaf2 80 100
4: scaf2 95 150
5: scaf4 70 100
6: scaf4 100 145