Question

我有一个位于不同染色体和位置的基因数据库。我还有一个具有特定位置的标记列表。我想要做的是找到“围绕”每个标记位置的基因。例如，我想提取给定标记+/- 50K的基因。此外，我想在输出中包含我找到的每个基因的标记信息。

这就是我所拥有的：

基因

gene    chrom   position
1_1 1   2164
1_2 1   11418
1_3 1   24840
1_4 1   63649
1_5 1   82098
1_6 1   110179
1_7 1   155165
1_8 1   186074
2_1 2   143076
2_2 2   148971
2_3 2   154134
2_4 2   165298
3_1 3   25612
3_2 3   65767
3_3 3   81952
3_4 3   111681
3_5 3   116253

标记：

Marker  chrom   position
1   1   101054
2   1   155002
3   9   6073302
4   8   5297131
5   5   12294888
6   8   6269394
7   10  1313426
8   1   56156551

这就是我想要的（样本）：

Marker  chrom   position    gene    chrom   position
1   1   101054  1_4 1   63649
1   1   101054  1_5 1   82098
1   1   101054  1_6 1   110179
2   1   155002  1_6 1   110179
2   1   155002  1_7 1   155165
2   1   155002  1_8 1   186074

到目前为止，这是我的代码：

marker<-read.table("markers.txt",sep="\t",header=T)
gene<-read.table("genes.txt",sep=""),sep="\t",header=T)

marker$low.lim<-marker$position-50000
marker$up.lim<-marker$position+50000

new<-gene[gene$chrom==marker$chrom[1] & gene$position %in% (marker$low.lim[1]:marker$up.lim[1]),]

我无法弄清楚如何用它做一个循环。感谢

Answer 1

R包GenomicRanges有助于处理基因组范围。

g.txt <- "gene    chrom   position
1_1 1   2164
1_2 1   11418
1_3 1   24840
1_4 1   63649
1_5 1   82098
1_6 1   110179
1_7 1   155165
1_8 1   186074
2_1 2   143076
2_2 2   148971
2_3 2   154134
2_4 2   165298
3_1 3   25612
3_2 3   65767
3_3 3   81952
3_4 3   111681
3_5 3   116253"

m.txt <- "Marker  chrom   position
1   1   101054
2   1   155002
3   9   6073302
4   8   5297131
5   5   12294888
6   8   6269394
7   10  1313426
8   1   56156551"

genes <- read.table(text=g.txt, head=T, as.is=T)
mark <- read.table(text=m.txt, head=T, as.is=T)

library(GenomicRanges)
genes.gr <- GRanges(genes$chrom, IRanges(genes$position, genes$position))
mark.gr <- GRanges(mark$chrom, IRanges(mark$position-50000, mark$position+50000))

g.m.op <- findOverlaps(genes.gr, mark.gr)   
cbind(mark[subjectHits(g.m.op), ], genes[queryHits(g.m.op), ])
#     Marker chrom position gene chrom position
# 1        1     1   101054  1_4     1    63649
# 1.1      1     1   101054  1_5     1    82098
# 1.2      1     1   101054  1_6     1   110179
# 2        2     1   155002  1_6     1   110179
# 2.1      2     1   155002  1_7     1   155165
# 2.2      2     1   155002  1_8     1   186074

循环用于根据列范围选择值

1 个答案: