分别获取密码子中同义和非同义核苷酸位置的范围

时间:2018-11-22 10:23:59

标签: r bioconductor genomicranges

我有GRanges对象(所有基因外显子的坐标); coding_pos定义特定外显子中密码子的起始位置(1表示外显子中的第一个核苷酸也是密码子中的第一个nt,依此类推)。

grTargetGene 本身看起来像这样

> grTargetGene

GRanges object with 11 ranges and 7 metadata columns:
   seqnames                 ranges strand |     ensembl_ids   gene_biotype prev_exons_length coding_pos
      <Rle>              <IRanges>  <Rle> |     <character>    <character>         <numeric>  <numeric>
   [1]     chr2 [148602722, 148602776]      + | ENSG00000121989 protein_coding       0           1
   [2]     chr2 [148653870, 148654077]      + | ENSG00000121989 protein_coding       55          2
   [3]     chr2 [148657027, 148657136]      + | ENSG00000121989 protein_coding       263         3
   [4]     chr2 [148657313, 148657467]      + | ENSG00000121989 protein_coding       373         2
   [5]     chr2 [148672760, 148672903]      + | ENSG00000121989 protein_coding       528         1
   [6]     chr2 [148674852, 148674995]      + | ENSG00000121989 protein_coding       672         1
   [7]     chr2 [148676016, 148676161]      + | ENSG00000121989 protein_coding       816         1
   [8]     chr2 [148677799, 148677913]      + | ENSG00000121989 protein_coding       962         3
   [9]     chr2 [148680542, 148680680]      + | ENSG00000121989 protein_coding       1077        1
  [10]     chr2 [148683600, 148683730]      + | ENSG00000121989 protein_coding       1216        2
  [11]     chr2 [148684649, 148684843]      + | ENSG00000121989 protein_coding       1347        1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

我有兴趣分别查看每个密码子和[3]中[1,2]个位置的坐标。换句话说,我想要2个不同的GRanges对象,它们看起来大致像这样(这里仅仅是开始)

> grTargetGene_Nonsynonym

GRanges object with X ranges and 7 metadata columns:
   seqnames                 ranges strand |     ensembl_ids   gene_biotype 
      <Rle>              <IRanges>  <Rle> |     <character>    <character> 
   [1]     chr2 [148602722, 148602723]      + | ENSG00000121989 protein_coding
   [2]     chr2 [148602725, 148602726]      + | ENSG00000121989 protein_coding
   [3]     chr2 [148602728, 148602729]      + | ENSG00000121989 protein_coding
   [4]     chr2 [148602731, 148602732]      + | ENSG00000121989 protein_coding



> grTargetGene_Synonym

GRanges object with X ranges and 7 metadata columns:
   seqnames                 ranges strand |     ensembl_ids   gene_biotype 
      <Rle>              <IRanges>  <Rle> |     <character>    <character> 
   [1]     chr2 [148602724, 148602724]      + | ENSG00000121989 protein_coding
   [2]     chr2 [148602727, 148602727]      + | ENSG00000121989 protein_coding
   [3]     chr2 [148602730, 148602730]      + | ENSG00000121989 protein_coding
   [4]     chr2 [148602733, 148602733]      + | ENSG00000121989 protein_coding

我打算通过循环来完成此工作,该循环根据coding_posstrand为每个外显子创建一组Grange,但是我怀疑有一种更聪明的方法甚至一个函数可以已经做到了,但是我找不到简单的解决方案。

重要:我不需要序列本身(在这种情况下,最简单的方法是先提取DNA,然后使用该序列),但是我只需要使用将要使用的位置即可与某些功能重叠。

> library("GenomicRanges")
> dput(grTargetGene)

new("GRanges"
, seqnames = new("Rle"
, values = structure(1L, .Label = "chr2", class = "factor")
, lengths = 6L
, elementMetadata = NULL
, metadata = list()
)
, ranges = new("IRanges"
, start = c(148602722L, 148653870L, 148657027L, 148657313L, 148672760L, 
148674852L)
, width = c(55L, 208L, 110L, 155L, 144L, 144L)
, NAMES = NULL
, elementType = "integer"
, elementMetadata = NULL
, metadata = list()
)
, strand = new("Rle"
, values = structure(1L, .Label = c("+", "-", "*"), class = "factor")
, lengths = 6L
, elementMetadata = NULL
, metadata = list()
)
, elementMetadata = new("DataFrame"
, rownames = NULL
, nrows = 6L
, listData = structure(list(ensembl_ids =
c("ENSG00000121989","ENSG00000121989", 
"ENSG00000121989", "ENSG00000121989", "ENSG00000121989", "ENSG00000121989"
), gene_biotype = c("protein_coding", "protein_coding", "protein_coding", 
"protein_coding", "protein_coding", "protein_coding"), cds_length =   
c(1542,1542, 1542, 1542, 1542, 1542), gene_start_position = c(148602086L, 
148602086L, 148602086L, 148602086L, 148602086L, 148602086L), 
gene_end_position = c(148688393L, 148688393L, 148688393L, 
148688393L, 148688393L, 148688393L), prev_exons_length = c(0, 
55, 263, 373, 528, 672), coding_pos = c(1, 2, 3, 2, 1, 1)), .Names =  
c("ensembl_ids", "gene_biotype", "cds_length", "gene_start_position",
"gene_end_position", 
"prev_exons_length", "coding_pos"))
, elementType = "ANY"
, elementMetadata = NULL
, metadata = list()
)
, seqinfo = new("Seqinfo"
, seqnames = "chr2"
, seqlengths = NA_integer_
, is_circular = NA
, genome = NA_character_
)
, metadata = list()
)

2 个答案:

答案 0 :(得分:2)

以下内容如何:

grl <- lapply(list(Nonsym = c(1, 2), Sym = c(3, 3)), function(x) {
    ranges(grTargetGene) <- IRanges(
        start = start(grTargetGene) + x[1] - 1,
        end = start(grTargetGene) + x[2] - 1)
    return(grTargetGene) })
grl
#$Nonsym
#GRanges object with 6 ranges and 7 metadata columns:
#      seqnames              ranges strand |     ensembl_ids   gene_biotype
#         <Rle>           <IRanges>  <Rle> |     <character>    <character>
#  [1]     chr2 148602722-148602723      + | ENSG00000121989 protein_coding
#  [2]     chr2 148653870-148653871      + | ENSG00000121989 protein_coding
#  [3]     chr2 148657027-148657028      + | ENSG00000121989 protein_coding
#  [4]     chr2 148657313-148657314      + | ENSG00000121989 protein_coding
#  [5]     chr2 148672760-148672761      + | ENSG00000121989 protein_coding
#  [6]     chr2 148674852-148674853      + | ENSG00000121989 protein_coding
#      cds_length gene_start_position gene_end_position prev_exons_length
#       <numeric>           <integer>         <integer>         <numeric>
#  [1]       1542           148602086         148688393                 0
#  [2]       1542           148602086         148688393                55
#  [3]       1542           148602086         148688393               263
#  [4]       1542           148602086         148688393               373
#  [5]       1542           148602086         148688393               528
#  [6]       1542           148602086         148688393               672
#      coding_pos
#       <numeric>
#  [1]          1
#  [2]          2
#  [3]          3
#  [4]          2
#  [5]          1
#  [6]          1
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths
#
#$Sym
#GRanges object with 6 ranges and 7 metadata columns:
#      seqnames    ranges strand |     ensembl_ids   gene_biotype cds_length
#         <Rle> <IRanges>  <Rle> |     <character>    <character>  <numeric>
#  [1]     chr2 148602724      + | ENSG00000121989 protein_coding       1542
#  [2]     chr2 148653872      + | ENSG00000121989 protein_coding       1542
#  [3]     chr2 148657029      + | ENSG00000121989 protein_coding       1542
#  [4]     chr2 148657315      + | ENSG00000121989 protein_coding       1542
#  [5]     chr2 148672762      + | ENSG00000121989 protein_coding       1542
#  [6]     chr2 148674854      + | ENSG00000121989 protein_coding       1542
#      gene_start_position gene_end_position prev_exons_length coding_pos
#                <integer>         <integer>         <numeric>  <numeric>
#  [1]           148602086         148688393                 0          1
#  [2]           148602086         148688393                55          2
#  [3]           148602086         148688393               263          3
#  [4]           148602086         148688393               373          2
#  [5]           148602086         148688393               528          1
#  [6]           148602086         148688393               672          1
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths

grl包含两个list中的GRanges,一个具有基于位置1和2的范围,另一个具有基于位置3的范围。

答案 1 :(得分:-1)

我创建了一个可以解释链的函数,并允许处理长度不能被3整除(甚至可能小于3)的外显子

$(document).on("keyup", "#post", function() {
$("#theLink").attr("href", "http://localhost/arany/?i=" + $("#post").val());
});

效果很好:

<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>
<textarea id="post" type="text"></textarea>
<a id="theLink" href="#">Reload</a>