我有一个数据集,其中包含一堆制造电路的间隔信息
df <- data.frame(structure(list(circuit = structure(c(2L, 1L, 2L, 1L, 2L, 3L,
1L, 1L, 2L), .Label = c("a", "b", "c"), class = "factor"), start = structure(c(1393621200,
1393627920, 1393628400, 1393631520, 1393650300, 1393646400, 1393656000,
1393668000, 1393666200), class = c("POSIXct", "POSIXt"), tzone = ""),
end = structure(c(1393626600, 1393631519, 1393639200, 1393632000,
1393660500, 1393673400, 1393667999, 1393671600, 1393677000
), class = c("POSIXct", "POSIXt"), tzone = ""), id = structure(1:9, .Label = c("1001",
"1002", "1003", "1004", "1005", "1006", "1007", "1008", "1009"
), class = "factor")), .Names = c("circuit", "start", "end",
"id"), class = "data.frame", row.names = c(NA, -9L)))
我能够创建一个计算重叠间隔数的新数据集:
ir <- IRanges(start = as.numeric(df$start), end = as.numeric(df$end), names = df$id)
cov <- coverage(ir)
start_time <- as.POSIXlt(start(cov), origin = "1970-01-01")
end_time <- as.POSIXlt(end(cov), origin = "1970-01-01")
seconds <- runLength(cov)
circuits_running <- runValue(cov)
res <- data.frame(start_time,end_time,seconds,circuits_running)[-1,]
但我真正需要的是看起来更像是这样的东西:
sqldf("select
res.start_time,
res.end_time,
res.seconds,
res.circuits_running,
df.circuit,
df.id
from res left join df on (res.start_time between df.start and df.end)")
问题是使用不等式连接的sqldf
方式对我的完整数据集来说是无法忍受的。
如何仅使用IRanges
获得类似内容?
我怀疑它与RangedData
有关,但我还没有看到如何得到我想要的东西。这是我尝试过的......
rd <- RangedData(ir, circuit = df$circuit, id = df$id)
coverage(rd) # works but seems to lose the circuit/id info
答案 0 :(得分:1)
覆盖率可以表示为范围,丢弃第一个范围(从1970年到第一个起始点的范围)
cov <- coverage(ir)
intervals <- ranges(cov)[-1]
您的查询是查找每个电路的间隔的开始,因此我将间隔缩小到它们的起始坐标并找到重叠(第一个参数是&#39;查询&#39;,第二个参数是&#39;受试者&#39)
olaps <- findOverlaps(narrow(intervals, width(intervals)), ir)
在特定时间间隔内运行的电路数量为
tabulate(queryHits(olaps), queryLength(olaps))
和实际电路
df[subjectHits(olaps), c("circuit", "id")]
这些碎片可以编织在一起,也许
df1 <- cbind(uid=seq_along(intervals),
as.data.frame(intervals),
circuits_running=tabulate(queryHits(olaps), queryLength(olaps)))
df2 <- cbind(uid=queryHits(olaps),
df[subjectHits(olaps), c("circuit", "id")])
merge(df1, df2, by="uid", all=TRUE)
范围可以与它们相关联&#39;元数据&#39;以协调的方式访问和子集,因此data.frame和范围之间的连接不必如此松散和临时。我可能会改为
ir <- IRanges(start = as.numeric(df$start), end = as.numeric(df$end))
mcols(ir) <- DataFrame(df)
## ...
mcols(ir[subjectHits(olaps)])
完成IRanges-land后,可能会as.data.frame()
。
最好在Bioconductor上提出有关IRanges的问题mailing list;无需订阅。