在数据框中查找重叠范围并为其指定值

时间:2018-01-11 12:30:22

标签: r dataframe range bioinformatics overlapping

我问的原始question的更简单版本,但没有人回答。

我有一个庞大的输入文件(其代表示例如下所示为input):

> input
           CT1           CT2           CT3
1 chr1:200-400  chr1:250-450  chr1:400-800
2 chr1:800-970  chr2:200-500  chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400

我想通过遵循规则(如下所述)来处理它,以便得到output之类的:

 > output
              CT1 CT2 CT3
chr1:200-400    1   1   0
chr1:800-970    1   0   1
chr2:300-700    1   1   0
chr1:250-450    1   1   1
chr2:200-500    1   1   0
chr2:600-1000   1   1   1
chr1:400-800    0   1   1
chr1:700-870    1   0   1
chr2:700-1400   0   1   1

规则: 获取数据帧的每个索引(本例中的第一个是chr1:200-400),看它是否与数据帧中的任何其他值重叠。如果是,请将1写在它所在的列的下方,如果不是,则写0

例如,如果我们取输入input[1,1]的第一个索引chr1:200-400。如第1列中所示,我们将在其下面写1。现在我们将检查此范围是否与input中任何其他列中存在的任何其他范围重叠。此值仅与第二列(chr1:250-450)的第一个值(CT2)重叠,因此,我们也在下面写入1。由于CT3中的任何值都没有重叠,我们会在输出数据框中的0下面写CT3

以下是inputoutput的输入:

> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400", 
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450", 
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800", 
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1", 
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L, 
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L, 
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400", 
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500", 
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))

1 个答案:

答案 0 :(得分:3)

使用data.table - 包的可能解决方案:

# load the 'data.table'-package and convert 'input' to a data.table with 'setDT'
library(data.table)
setDT(input)

# reshape 'input' to long format and split the strings in 3 columns
DT <- melt(input, measure.vars = 1:3)[, c('chr','low','high') := tstrsplit(value, split = ':|-', type.convert = TRUE)
                                      , by = variable][]

# create aggregation function; needed in the ast reshape step
f <- function(x) as.integer(length(x) > 0)

# cartesian self join & reshape result back to wide format with aggregation function
DT[DT, on = .(chr, low < high, high > low), allow.cartesian = TRUE
   ][, dcast(.SD, value ~ i.variable, fun = f)]

给出:

           value CT1 CT2 CT3
1:  chr1:200-400   1   1   0
2:  chr1:250-450   1   1   1
3:  chr1:400-800   0   1   1
4:  chr1:700-870   1   0   1
5:  chr1:800-970   1   0   1
6:  chr2:200-500   1   1   0
7:  chr2:300-700   1   1   0
8: chr2:600-1000   1   1   1
9: chr2:700-1400   0   1   1