利用R中的wig文件计算等位基因频率

时间:2015-02-13 17:51:12

标签: r

  • 我有一个矩阵(类似于假发文件),如下所示:

    Position reference A C G T N sum(total read counts)
    68773265 A 1 0 0 0 0   1 
    68773266 C 0 1 0 1 0   2
    68773267 C 0 1 1 2 0   4
    
  • 实现变异(非参考)等位基因比率,

我想创建一个:每个位置(sum-reference sequence's count)/sum * 100

   Position reference frequency(%) sum(total read counts) 
   68773265 A  0 1
   68773266 C 50 2
   68773267 C 75 4

请就此问题给我一些建议。在此先感谢!!

1 个答案:

答案 0 :(得分:0)

使用列名“nm1”的子集,match“参考”列和“nm1”获取列索引,cbind使用1:nrow(df1)创建行/列指数。获取rowSums“nm1”列(“Sum1”),使用此列根据帖子中的公式创建“frequencyPercent”。

nm1 <- c('A', 'C', 'G', 'T') # this could include `N` also
indx <- cbind(1:nrow(df1), match(df1$reference, nm1))
Sum1 <- rowSums(df1[nm1])
data.frame(df1[1:2], frequencyPercent=100*(Sum1-df1[nm1][indx])/Sum1, 
          SumTotalCounts=df1[,ncol(df1)])

或在原始数据集上使用transform

 transform(df1, frequencyPercent=100*(Sum1-df1[nm1][indx])/Sum1,
        check.names=FALSE)[c(1:2,8:9)]
 #   Position reference sum(total read counts) frequencyPercent
 #1 68773265         A                      1         0
 #2 68773266         C                      2        50
 #3 68773267         C                      4        75

数据

df1 <- structure(list(Position = 68773265:68773267, reference = c("A", 
"C", "C"), A = c(1L, 0L, 0L), C = c(0L, 1L, 1L), G = c(0L, 0L, 
1L), T = 0:2, N = c(0L, 0L, 0L), `sum(total read counts)` = c(1L, 
2L, 4L)), .Names = c("Position", "reference", "A", "C", "G", 
"T", "N", "sum(total read counts)"), class = "data.frame", 
row.names = c(NA, -3L))