如何在R中的圆圈间均匀分布基因间隔?

时间:2014-12-07 16:08:22

标签: r geometry

我在人类的22条染色体中分布着不同数量的基因,每条染色体的位置从0个碱基对开始,我试图找到一种方法将基因区间均匀分布在一个圆圈中,保持每个基因之间的相对位置和每个基因的长度,但重建新的位置,使基因可以在每条染色体上均匀分布,并在每条染色体之间留下一个空间。这是数据的一个例子(完整的数据集包括所有染色体):

df = structure(list(Chr = structure(c(1L, 1L, 1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 9L, 10L, 11L, 12L, 12L, 13L, 17L, 20L, 22L, 22L), .Label = c("chr1", 
"chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", 
"chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", 
"chr17", "chr18", "chr19", "chr20", "chr21", "chr22"), class = "factor"), 
    start = c(19068972, 25996369, 235879265, 46650500, 57732485, 
    44224566, 127510071, 33694865, 2297266, 105108497, 35252252, 
    64633822, 125738394, 416309, 93636009, 50070191, 72389245, 
    36432660, 19608500, 31498612), stop = c(20068972L, 26996369L, 
    236879265L, 47650500L, 58732485L, 45224566L, 128510071L, 
    34694865L, 3297266L, 106108497L, 36267753L, 65633822L, 126754018L, 
    1416309L, 94636009L, 51070191L, 73389245L, 37432660L, 20608500L, 
    32498612L), Gene = c("KIAA0090", "ZNF593", "GPR137B", "MCFD2", 
    "ABHD6", "GUF1", "FBN2", "HMGA1", "GNA12", "LRP12", "GBA2", 
    "NRBF2", "ST3GAL4", "WNK1", "SOCS2", "DLEU2", "FADS6", "BPI", 
    "TRMT2A", "PISD")), .Names = c("Chr", "start", "stop", "Gene"
), class = "data.frame", row.names = c(1L, 2L, 3L, 4L, 
5L, 6L,7L, 8L, 9L, 10L, 11L, 12L, 13L, 
14L, 15L, 16L, 17L, 18L, 19L, 20L))

我想要实现的是从每个染色体的0开始,重新分配基因间隔,使每个基因之间的空间相等(以及下一个基因之前的某个空间):

out = structure(list(Chr = structure(c(1L, 1L, 1L, 2L, 3L, 4L, 5L, 
6L, 7L, 8L, 9L, 10L, 11L, 12L, 12L, 13L, 17L, 20L, 22L, 22L), .Label = c("chr1", 
"chr2", "chr3", "chr4", "chr5", "chr6", "chr7", "chr8", "chr9", 
"chr10", "chr11", "chr12", "chr13", "chr14", "chr15", "chr16", 
"chr17", "chr18", "chr19", "chr20", "chr21", "chr22"), class = "factor"), 
    start = c(2000000, 4000000, 6000000, 2000000, 2000000, 
    2000000, 2000000, 2000000, 2000000, 2000000, 2000000, 
    2000000,2000000, 2000000, 4000000, 2000000, 2000000, 
    2000000, 2000000, 2000000), stop = c(3000000L, 5000000L, 
    7000000L, 3000000L, 3000000L, 3000000L, 3000000L, 
    3000000L, 3000000L, 3000000L, 3000000L, 3000000L, 3000000L, 
    3000000L, 5000000L, 3000000L, 3000000L, 3000000L, 3000000L, 
    3000000L), Gene = c("KIAA0090", "ZNF593", "GPR137B", "MCFD2", 
    "ABHD6", "GUF1", "FBN2", "HMGA1", "GNA12", "LRP12", "GBA2", 
    "NRBF2", "ST3GAL4", "WNK1", "SOCS2", "DLEU2", "FADS6", "BPI", 
    "TRMT2A", "PISD")), .Names = c("Chr", "start", "stop", "Gene"
), class = "data.frame", row.names = c(1L, 2L, 3L, 4L, 
5L, 6L,7L, 8L, 9L, 10L, 11L, 12L, 13L, 
14L, 15L, 16L, 17L, 18L, 19L, 20L))

对于这个数据子集,可以说它们之间有1000000个碱基对,但是我遇到的困难是决定如何选择这个值。我是否将周长分成我所有染色体上的基因数量,并尝试从中找到正确的区间?谢谢你的任何建议!

-fra

1 个答案:

答案 0 :(得分:1)

好的,这是一个部分答案,可能会给你一个想法,让你完成整个事情。请注意,我的结果显示chr1上的重叠,但它可能是我的数学。我会让你跟踪它,因为我不确定这是你需要的解决方案,我无法测试绘图方面。

# Focus on chr1, 20 as they have multiple genes
df2 <- df[c(1:3, 19:20),]

Norm <- function(chrom) { # run on one chromosome at a time
    start <- chrom$start
    stop <- chrom$stop
    totLength <- max(stop) - min(start) 
    # simple normalization & offset
    newSt <- start/totLength
    newSt <- newSt - min(newSt)
    newEnd <- stop/totLength
    newEnd <- newEnd - min(newSt)
    totMax <- max(newSt, newEnd)
    newSt <- newSt/totMax
    newEnd <- newEnd/totMax
    return(data.frame(start = newSt, stop = newEnd))
}

normAll <- function(df) {
    #chromLvls <- levels(df$Chr) # this would work except your example data is truncated
    # and some levels are  missing
    chromLvls <- unique(as.character(df$Chr))
    noChrom <- length(chromLvls)
    drop <- 1:nrow(df)
    for (i in 1:noChrom) {
        df2 <- subset(df, df$Chr == chromLvls[i])
        df2[,c(2,3)] <- Norm(df2)
        df <- rbind(df, df2)
        }
    df <- df[-drop,] # row no's are mangled, may not matter
 }

res <- normAll(df2)

addGapBtwGenes <- function(df, gap = 0.05) {
    # gap is a fraction on [0...1]
    # this acts on a subset composed of just one chromosome
    for (i in 1:nrow(df)) {
        df$start[i] <- df$start[i] + (i-1)*gap
        df$stop[i] <- df$stop [i]+ (i)*gap
        # this denormalizes things but that probably doesn't matter
        }
    return(df)
    }

gapAllGenes <- function(df, gap = 0.05) {
    #chromLvls <- levels(df$Chr) # this would work except your example data is truncated
    # and some levels are  missing
    chromLvls <- unique(as.character(df$Chr))
    noChrom <- length(chromLvls)
    drop <- 1:nrow(df)
    for (i in 1:noChrom) {
        df2 <- subset(df, df$Chr == chromLvls[i])
        if (nrow(df2) == 1) { # no gap needed
           df <- rbind(df, df2)
           next
          }
        df2 <- addGapBtwGenes(df2, gap = gap)
        df <- rbind(df, df2)
        }
    df <- df[-drop,] # row no's are mangled, may not matter
 }

res2 <- gapAllGenes(res)

你可以编写一个名为addGapBtwChrom的函数来控制这个差距,除非绘图软件允许这样做。

上面给出了res:

      Chr      start       stop     Gene
6    chr1 0.00000000 0.08472237 KIAA0090
7    chr1 0.02924442 0.11396679   ZNF593
8    chr1 0.91527763 1.00000000  GPR137B
191 chr22 0.00000000 0.63413477   TRMT2A
201 chr22 0.36586523 1.00000000     PISD

和res2:

       Chr      start      stop     Gene
61    chr1 0.00000000 0.1347224 KIAA0090
71    chr1 0.07924442 0.2139668   ZNF593
81    chr1 1.01527763 1.1500000  GPR137B
1911 chr22 0.00000000 0.6841348   TRMT2A
2011 chr22 0.41586523 1.1000000     PISD

也许这接近你想要的。重新阅读你的一些评论,我看到我把所有的chromsomes都做了大致相同的长度,但是,我再等一下,看看你对它的看法。