如何将数据从列转换为行,一次转换为两列?

时间:2018-12-22 22:49:30

标签: r dataframe bioinformatics

我试图将数据从列转置为行,但一次只能在所有行中进行两列。但是跳过前两列。

我的起始数据框如下所示: 每行都是遗传标记。前两列提供了该标记的位置信息,随后几列提供了该特定标记处的个体的DNA核苷酸信息。

但是,每个人每个标记都有两个核苷酸。

零表示缺失值。

因此,在此数据框中,该行上有5个遗传标记,共有3个个体。 (个人1在V1和V2中都有两个核苷酸,个人2在V3和V4中都有它们,依此类推)。

group pos V1 V2 V3 V4 V5 V6 
1     10  A  A  G  G  T  T
2     11  C  C  G  G  A  A
3     12  T  T  T  A  C  G
4     13  0  0  0  A  C  G
5     14  G  T  0  0  C  A    

我想对数据重新排序,以使个人在行中,而遗传标记在列中。但是,我想将核苷酸的“对”放在一起,并忽略前两列。

我要放这个文件:

A A C C T T 0 0 G T 
G G G G T A 0 A 0 0 
T T A A C G C G C A 

到目前为止,我已经编写了一个有效的循环。但这太慢了,它不能真正处理超过4万行。我的数据帧可以是50万行和130列。

oi2 <- list(NA) # create an empty list assigned to "oi2"
for(j in seq(3, ncol(data), 2)) { # create a sequence of data subset to keep 2 columns together 
oi <- "" # create an empty vector 
  for(i in 1:nrow(data)) { # do it for every row 
    oi <- c(oi, as.character(data[i,j]), as.character(data[i,j+1])) # add data together in a row 
  } # loop ends for row loop, were still inside first loop 
 oi <- oi[-1] # remove first "" element 
  oi2[[j-2]] <- oi # once oi is created, save to list "oi2", assigned to j-2 position in list 
} # loop closes 
oi3 <- oi2[!sapply(oi2, is.null)] # remove null elements in data frame 
# unlist the list and then convert to matrix, and then to data frame 
df <- data.frame(matrix(unlist(oi2), nrow=length(oi3), byrow=T, 
                          ncol = length(oi3[[1]]))) 

是否有更优雅的方法可以更快地处理大型数据帧?

2 个答案:

答案 0 :(得分:0)

这可能不是最有效的方法,但是我可以在这个小时(凌晨1点)想出的最好的方法

样本数据

library( data.table )
dt <- fread("group pos V1 V2 V3 V4 V5 V6 
1     10  A  A  G  G  T  T
2     11  C  C  G  G  A  A
3     12  T  T  T  A  C  G
4     13  0  0  0  A  C  G
5     14  G  T  0  0  C  A", header = TRUE, stringsAsFactors = FALSE)

代码

library( tidyverse )
#paste together the rows of the dt (minus col 1 and 2)
l1 <- pmap( dt[, -c(1,2)], paste, sep = '')
#split the values in the list into pairs of 2 letters
l2 <- lapply( l1, strsplit, "(?<=.{2})", perl = TRUE )
#unlist
data <- unlist(l2)
#build a new matrix with three rows
matrix( data, nrow = 3) %>% apply( ., 1, paste, collapse = "")

输出

#[1] "AACCTT00GT" "GGGGTA0A00" "TTAACGCGCA"

答案 1 :(得分:0)

1)假定末尾的注记中可重复显示的输入DF将除前2列以外的所有列转换为5x6矩阵,然后将其重塑为5x2x3数组,对其进行置换尺寸并重塑为3x10矩阵。不使用任何软件包。

m <- as.matrix(DF[-(1:2)]
nr <- nrow(m) # 5
nc <- ncol(m) # 6

matrix(aperm(array(m, c(nr, 2, nc/2)), c(3, 2, 1)), nc/2)

给予:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "A"  "A"  "C"  "C"  "T"  "T"  "0"  "0"  "G"  "T"  
[2,] "G"  "G"  "G"  "G"  "T"  "A"  "0"  "A"  "0"  "0"  
[3,] "T"  "T"  "A"  "A"  "C"  "G"  "C"  "G"  "C"  "A" 

2)上面的一种变化是首先转置m,将其整形为数组,然后将刚开始的两个维重新排列为最后的矩阵。 / p>

matrix(aperm(array(t(m), c(2, nc/2, nr)), c(2, 1, 3)), nc/2)

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "A"  "A"  "C"  "C"  "T"  "T"  "0"  "0"  "G"  "T"  
[2,] "G"  "G"  "G"  "G"  "T"  "A"  "0"  "A"  "0"  "0"  
[3,] "T"  "T"  "A"  "A"  "C"  "G"  "C"  "G"  "C"  "A"  

注意

Lines <- "
group pos V1 V2 V3 V4 V5 V6 
1     10  A  A  G  G  T  T
2     11  C  C  G  G  A  A
3     12  T  T  T  A  C  G
4     13  0  0  0  A  C  G
5     14  G  T  0  0  C  A"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)