Question

我试图将数据从列转置为行，但一次只能在所有行中进行两列。但是跳过前两列。

我的起始数据框如下所示：每行都是遗传标记。前两列提供了该标记的位置信息，随后几列提供了该特定标记处的个体的DNA核苷酸信息。

但是，每个人每个标记都有两个核苷酸。

零表示缺失值。

因此，在此数据框中，该行上有5个遗传标记，共有3个个体。（个人1在V1和V2中都有两个核苷酸，个人2在V3和V4中都有它们，依此类推）。

group pos V1 V2 V3 V4 V5 V6 
1     10  A  A  G  G  T  T
2     11  C  C  G  G  A  A
3     12  T  T  T  A  C  G
4     13  0  0  0  A  C  G
5     14  G  T  0  0  C  A

我想对数据重新排序，以使个人在行中，而遗传标记在列中。但是，我想将核苷酸的“对”放在一起，并忽略前两列。

我要放这个文件：

A A C C T T 0 0 G T 
G G G G T A 0 A 0 0 
T T A A C G C G C A

到目前为止，我已经编写了一个有效的循环。但这太慢了，它不能真正处理超过4万行。我的数据帧可以是50万行和130列。

oi2 <- list(NA) # create an empty list assigned to "oi2"
for(j in seq(3, ncol(data), 2)) { # create a sequence of data subset to keep 2 columns together 
oi <- "" # create an empty vector 
  for(i in 1:nrow(data)) { # do it for every row 
    oi <- c(oi, as.character(data[i,j]), as.character(data[i,j+1])) # add data together in a row 
  } # loop ends for row loop, were still inside first loop 
 oi <- oi[-1] # remove first "" element 
  oi2[[j-2]] <- oi # once oi is created, save to list "oi2", assigned to j-2 position in list 
} # loop closes 
oi3 <- oi2[!sapply(oi2, is.null)] # remove null elements in data frame 
# unlist the list and then convert to matrix, and then to data frame 
df <- data.frame(matrix(unlist(oi2), nrow=length(oi3), byrow=T, 
                          ncol = length(oi3[[1]])))

是否有更优雅的方法可以更快地处理大型数据帧？

Answer 1

这可能不是最有效的方法，但是我可以在这个小时（凌晨1点）想出的最好的方法

样本数据

library( data.table )
dt <- fread("group pos V1 V2 V3 V4 V5 V6 
1     10  A  A  G  G  T  T
2     11  C  C  G  G  A  A
3     12  T  T  T  A  C  G
4     13  0  0  0  A  C  G
5     14  G  T  0  0  C  A", header = TRUE, stringsAsFactors = FALSE)

代码

library( tidyverse )
#paste together the rows of the dt (minus col 1 and 2)
l1 <- pmap( dt[, -c(1,2)], paste, sep = '')
#split the values in the list into pairs of 2 letters
l2 <- lapply( l1, strsplit, "(?<=.{2})", perl = TRUE )
#unlist
data <- unlist(l2)
#build a new matrix with three rows
matrix( data, nrow = 3) %>% apply( ., 1, paste, collapse = "")

输出

#[1] "AACCTT00GT" "GGGGTA0A00" "TTAACGCGCA"

Answer 2

1）假定末尾的注记中可重复显示的输入DF将除前2列以外的所有列转换为5x6矩阵，然后将其重塑为5x2x3数组，对其进行置换尺寸并重塑为3x10矩阵。不使用任何软件包。

m <- as.matrix(DF[-(1:2)]
nr <- nrow(m) # 5
nc <- ncol(m) # 6

matrix(aperm(array(m, c(nr, 2, nc/2)), c(3, 2, 1)), nc/2)

给予：

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "A"  "A"  "C"  "C"  "T"  "T"  "0"  "0"  "G"  "T"  
[2,] "G"  "G"  "G"  "G"  "T"  "A"  "0"  "A"  "0"  "0"  
[3,] "T"  "T"  "A"  "A"  "C"  "G"  "C"  "G"  "C"  "A"

2）上面的一种变化是首先转置m，将其整形为数组，然后将刚开始的两个维重新排列为最后的矩阵。 / p>

matrix(aperm(array(t(m), c(2, nc/2, nr)), c(2, 1, 3)), nc/2)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "A"  "A"  "C"  "C"  "T"  "T"  "0"  "0"  "G"  "T"  
[2,] "G"  "G"  "G"  "G"  "T"  "A"  "0"  "A"  "0"  "0"  
[3,] "T"  "T"  "A"  "A"  "C"  "G"  "C"  "G"  "C"  "A"

注意

Lines <- "
group pos V1 V2 V3 V4 V5 V6 
1     10  A  A  G  G  T  T
2     11  C  C  G  G  A  A
3     12  T  T  T  A  C  G
4     13  0  0  0  A  C  G
5     14  G  T  0  0  C  A"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)

如何将数据从列转换为行，一次转换为两列？

2 个答案:

注意