Question

如何使用基本代码将FASTA文件直接读入R中的数据帧。这些文件存储信息生物序列（例如DNA或蛋白质），并且对于n个单独的生物分子（id1至idn）具有2 * n行，因此属于以下类型：

>id1 #(always starts with a `>`) 
seq1
>id2
seq2
...
>idn
seqn

如果一个人想要在基地R（而不是像Biostrings和seqinr这样的专用包，它们利用新的类来进行生物序列的各种操作），你如何使用例如read.table，获取一个带id和seq列的简单数据框？

Answer 1

在基础R中肯定是可能的。考虑以下示例和函数：

# Demo data
library(CHNOSZ)
file <- system.file("extdata/fasta/EF-Tu.aln", package="CHNOSZ")

# Function
ReadFasta<-function(file) {
   # Read the file line by line
   fasta<-readLines(file)
   # Identify header lines
   ind<-grep(">", fasta)
   # Identify the sequence lines
   s<-data.frame(ind=ind, from=ind+1, to=c((ind-1)[-1], length(fasta)))
   # Process sequence lines
   seqs<-rep(NA, length(ind))
   for(i in 1:length(ind)) {
      seqs[i]<-paste(fasta[s$from[i]:s$to[i]], collapse="")
   }
   # Create a data frame 
   DF<-data.frame(name=gsub(">", "", fasta[ind]), sequence=seqs)
   # Return the data frame as a result object from the function
   return(DF)
}

# Usage example
seqs<-ReadFasta(file)

但是，请注意：该函数当前不处理，例如，特殊字符，这在序列文件中相当普遍（在5＆＃39;或＃5 rRNA等上下文中）。

R：使用基数R - NOT Biostrings（等等）将fasta文件读入data.frame

1 个答案: