Question

我有一个名为mymat的数据矩阵。样本.GT和00860有00861列。我想用新的.AD列扩展此矩阵。如果.AD为50,0，则每个样本的相应.GT列的值为0/0，如果25/25为.GT，则为0/1如果0,50为.GT，则为1/1。我还想在每列旁边添加另一个名为.DP的列，该列在整列中都有50并获得result。如何在R？

中进行矩阵的这种条件扩展

mymat <- structure(c("0/1", "1/1", "0/0", "0/0"), .Dim = c(2L, 2L), .Dimnames = list(
c("chr1:1163804", "chr1:1888193"
), c("00860.GT", "00861.GT")))

结果：

           00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
chr1:1163804 0/1      25/25       50      0/0     50,0     50
chr1:1888193 1/1      0/50        50      0/0     50,0     50

Answer 1

可能有更好的方法，但使用dplyr执行此操作的一种方法是：

library(dplyr)

set.AD <- function(x) {                                                   ## 1.
  if (x=="0/0") {
    return("50/0")
  } else if (x=="0/1") {
    return("25/25")
  } else {
    return("0/50")
  }
}
mymat <- data.frame(ID=seq_len(nrow(mymat)),mymat)                        ## 2.
rnames <- rownames(mymat)
out = mymat %>% group_by(ID)                                              ## 3.
            %>% mutate(`X00860.AD`=set.AD(`X00860.GT`), `X00860.DP`=50,
                       `X00861.AD`=set.AD(`X00861.GT`), `X00861.DP`=50)
out <- data.frame(out[,-1])                                               ## 4.
rownames(out) <- rnames

注意：

定义一个函数，根据您的逻辑设置AD列中的GT列。
将数据转换为数据框，添加唯一标识符列，以便我们可以使用group_by将函数应用于每一行。同时保留行名称。
使用mutate为AD和DP列创建X00860.GT和X00861.GT列。请注意，转换为数据框的前缀是X列名，因为R不喜欢以数字开头的列名。有关说明，请参阅此SO answer。

此时返回的是tibble。因此，

删除ID列，转换为数据框，然后添加行名称。

您的数据的结果是：

print(out)
##             X00860.GT X00861.GT X00860.AD X00860.DP X00861.AD X00861.DP
##chr1:1163804       0/1       0/0     25/25        50      50/0        50
##chr1:1888193       1/1       0/0      0/50        50      50/0        50

要重新排列列以匹配您的输出，您可以简单地：

out <- out[,c(1,3,4,2,5,6)]
##             X00860.GT X00860.AD X00860.DP X00861.GT X00861.AD X00861.DP
##chr1:1163804       0/1     25/25        50       0/0      50/0        50
##chr1:1888193       1/1      0/50        50       0/0      50/0        50

显然，这种方法只能处理两列，但可以处理任意数量的行。

编辑处理任意数量的列（样本）

注释以注释形式提供

# keep column and row names of original mymat to use later
cnames <- colnames(mymat)
rnames <- rownames(mymat)
# since DP columns are always 50, we just create a data frame filled with 50
# to bind to the result as additional columns
dp <- data.frame(matrix(rep(50,ncol(mymat)*nrow(mymat)), nrow=nrow(mymat), ncol=ncol(mymat)))
# set the column name to that of mymat
colnames(dp) <- cnames
# convert to data frame and augment with ID as before
mymat <- data.frame(ID=seq_len(nrow(mymat)),mymat)
# the difference here is that we use mutate_each to apply set.AD to each
# (and all) column of the input. This is done in-place. We then bind the 
# original mymat and dp as columns to this result
out <- mymat %>% group_by(ID) 
             %>% mutate_each(funs(set.AD)) 
             %>% ungroup() %>% select(-ID) 
             %>% bind_cols(mymat[,-1],.) %>% bind_cols(dp)
# At this point, we have the original mymat columns followed by the 
# AD columns followed by the DP columns. The following uses a matrix 
# transpose trick to resort the columns to what you want
col.order <- as.vector(t(matrix(seq_len(ncol(out)), nrow=ncol(mymat)-1, ncol=3)))
out <- data.frame(out[,col.order])
# finally, use gsub to change the column names for the AD and DP columns,
# get rid of the 'X' in the column names, and add back the row names
colnames(out) <- gsub("X", "", gsub("GT.1", "AD", gsub("GT.2", "DP", colnames(out))))
rownames(out) <- rnames
print(out)
##             00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
##chr1:1163804      0/1    25/25       50      0/0     50/0       50
##chr1:1888193      1/1     0/50       50      0/0     50/0       50

希望这有帮助。

Answer 2

这是一个data.table解决方案，每行都有评论。它被编写为处理function yo() { var numbers = new Array(4, 5, 7, 2); var total = 0; for (var i = 0; i < numbers.length; i++) { total += numbers[i]; } var avg = total / numbers.length; console.log(avg); alert(avg); return avg; } yo();对象中的任意数量的列。我将简要解释一下：

1）首先，我们转换为data.table格式，我们可以处理任意数量的列，假设它们的格式类似。

2）我们找到所有的＆＃34; .GT＆＃34;列和＃34; .GT＆＃34;。

之前提取数字

3）我们创造了＃34; .DP＆＃34;每个＆＃34; .GT＆＃34;的列专栏找到。

4）我们开发了一个＆＃34; GT＆＃34;到＆＃34; AD＆＃34;通过创建＆＃34;到＆＃34;的向量进行映射映射的一部分。＆＃34;来自＆＃34; part作为名称存储在向量中。

5）使用data.table中的.SDcols功能来应用＆＃34; GT＆＃34;到＆＃34; AD＆＃34;映射，并创建＆＃34; AD＆＃34;列。

mymat

如何扩展相应列名的数据矩阵

2 个答案: