我问的原始question的更简单版本,但没有人回答。
我有一个庞大的输入文件(其代表示例如下所示为input
):
> input
CT1 CT2 CT3
1 chr1:200-400 chr1:250-450 chr1:400-800
2 chr1:800-970 chr2:200-500 chr1:700-870
3 chr2:300-700 chr2:600-1000 chr2:700-1400
我想通过遵循规则(如下所述)来处理它,以便得到output
之类的:
> output
CT1 CT2 CT3
chr1:200-400 1 1 0
chr1:800-970 1 0 1
chr2:300-700 1 1 0
chr1:250-450 1 1 1
chr2:200-500 1 1 0
chr2:600-1000 1 1 1
chr1:400-800 0 1 1
chr1:700-870 1 0 1
chr2:700-1400 0 1 1
规则:
获取数据帧的每个索引(本例中的第一个是chr1:200-400
),看它是否与数据帧中的任何其他值重叠。如果是,请将1
写在它所在的列的下方,如果不是,则写0
。
例如,如果我们取输入input[1,1]
的第一个索引chr1:200-400
。如第1列中所示,我们将在其下面写1。现在我们将检查此范围是否与input
中任何其他列中存在的任何其他范围重叠。此值仅与第二列(chr1:250-450
)的第一个值(CT2
)重叠,因此,我们也在下面写入1。由于CT3
中的任何值都没有重叠,我们会在输出数据框中的0
下面写CT3
。
以下是input
和output
的输入:
> dput(input)
structure(list(CT1 = structure(1:3, .Label = c("chr1:200-400",
"chr1:800-970", "chr2:300-700"), class = "factor"), CT2 = structure(1:3, .Label = c("chr1:250-450",
"chr2:200-500", "chr2:600-1000"), class = "factor"), CT3 = structure(1:3, .Label = c("chr1:400-800",
"chr1:700-870", "chr2:700-1400"), class = "factor")), .Names = c("CT1",
"CT2", "CT3"), class = "data.frame", row.names = c(NA, -3L))
> dput(output)
structure(list(CT1 = c(1L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L), CT2 = c(1L,
0L, 1L, 1L, 1L, 1L, 0L, 1L, 1L), CT3 = c(0L, 0L, 0L, 0L, 0L,
1L, 1L, 1L, 1L)), .Names = c("CT1", "CT2", "CT3"), class = "data.frame", row.names = c("chr1:200-400",
"chr1:800-970", "chr2:300-700", "chr1:250-450", "chr2:200-500",
"chr2:600-1000", "chr1:400-800", "chr1:700-870", "chr2:700-1400"
))
答案 0 :(得分:3)
使用data.table
- 包的可能解决方案:
# load the 'data.table'-package and convert 'input' to a data.table with 'setDT'
library(data.table)
setDT(input)
# reshape 'input' to long format and split the strings in 3 columns
DT <- melt(input, measure.vars = 1:3)[, c('chr','low','high') := tstrsplit(value, split = ':|-', type.convert = TRUE)
, by = variable][]
# create aggregation function; needed in the ast reshape step
f <- function(x) as.integer(length(x) > 0)
# cartesian self join & reshape result back to wide format with aggregation function
DT[DT, on = .(chr, low < high, high > low), allow.cartesian = TRUE
][, dcast(.SD, value ~ i.variable, fun = f)]
给出:
value CT1 CT2 CT3 1: chr1:200-400 1 1 0 2: chr1:250-450 1 1 1 3: chr1:400-800 0 1 1 4: chr1:700-870 1 0 1 5: chr1:800-970 1 0 1 6: chr2:200-500 1 1 0 7: chr2:300-700 1 1 0 8: chr2:600-1000 1 1 1 9: chr2:700-1400 0 1 1