Question

我写了一个代码，它工作正常。但是，出于实际原因（我也想了解更多），如果我的工作方式较短，则是理想的选择。这是我正在阅读的文本文件的示例：

Analysis Date: Tue Oct 16 09:39:06 EDT 2018
Input file(s): 012-915-8-rep1.fastq
Output file(s): 012-915-8-rep1.vdjca
Version: 2.1.12; built=Wed Aug 22 08:47:36 EDT 2018; rev=99f9cc0; 
lib=repseqio.v1.5
Command line arguments: align -c IGH -r report012-915-8-rep1.txt 012-915-8- 
rep1.fastq 012-915-8-rep1.vdjca
Analysis time: 45.45s
Total sequencing reads: 198274
Successfully aligned reads: 167824 (84.64%)
Alignment failed, no hits (not TCR/IG?): 12122 (6.11%)
Alignment failed because of absence of J hits: 18235 (9.2%)
Alignment failed because of low total score: 93 (0.05%)
Overlapped: 0 (0%)
Overlapped and aligned: 0 (0%)
Alignment-aided overlaps: 0 (?%)
Overlapped and not aligned: 0 (0%)
IGH chains: 167824 (100%)
======================================
Analysis Date: Tue Oct 16 09:39:52 EDT 2018
Input file(s): 012-915-8-rep1.vdjca
Output file(s): 012-915-8-rep1.clns
Version: 2.1.12; built=Wed Aug 22 08:47:36 EDT 2018; rev=99f9cc0; lib=repseqio.v1.5
Command line arguments: assemble -OaddReadsCountOnClustering=true -r 
report012-915-8-rep1.txt 012-915-8-rep1.vdjca 012-915-8-rep1.clns
Analysis time: 7.50s
Final clonotype count: 1227
Average number of reads per clonotype: 124.77
Reads used in clonotypes, percent of total: 153096 (77.21%)
Reads used in clonotypes before clustering, percent of total: 153096 (77.21%)
Number of reads used as a core, percent of used: 113699 (74.27%)
Mapped low quality reads, percent of used: 39397 (25.73%)
Reads clustered in PCR error correction, percent of used: 14522 (9.49%)
Reads pre-clustered due to the similar VJC-lists, percent of used: 0 (0%)
Reads dropped due to the lack of a clone sequence: 8958 (4.52%)
Reads dropped due to low quality: 0 (0%)
Reads dropped due to failed mapping: 5770 (2.91%)
Reads dropped with low quality clones: 0 (0%)
Clonotypes eliminated by PCR error correction: 5550
Clonotypes dropped as low quality: 0
Clonotypes pre-clustered due to the similar VJC-lists: 0
======================================

我基本上只需要第7、8和26行，它们是：“总测序读数”，“成功比对的读数”和“克隆型使用的读数，占总数的百分比”。其他一切都可以消除。我对几个文本文件执行的代码如下：

> # Put in your actual path where the text files are saved 
> mypath = "C:/Users/ME/Desktop/REPORTS/text files/" 
> setwd(mypath)
> #############################################################
> #Functional Code
> #Establish the dataframe 
> data <- data.frame("Total seq Reads"=integer(), "Successful Reads"=integer(), "Clonotypes"=integer())
> 
> #this should be a loop, I think, same action repeats, I just dont know how to format
> 
> wow <- readLines("C:/Users/ME/Desktop/REPORTS/text files/report012-915-8-rep1.txt") 
> woah <- wow[-c(1:6,9:25,27:39)] 
> blah <- as.numeric(gsub("\\D", "", gsub("\\(.*\\)", "", woah)))
> data[nrow(data)+1,] <- blah
> 
> wow <- readLines("C:/Users/ME/Desktop/REPORTS/text files/report012-915-8-rep2.txt") 
> woah <- wow[-c(1:6,9:25,27:39)] 
> blah <- as.numeric(gsub("\\D", "", gsub("\\(.*\\)", "", woah)))
> data[nrow(data)+1,] <- blah
>
> row.names(data) <- c("012-915-8-rep1","012-915-8-rep2")
>
># Write CSV in R
> write.csv(data, file = "Report_Summary.csv")

有没有更有效的方法？我在这里仅以2个文件为例，但实际上我使用了20-80个文件，这意味着我必须手动执行此过程。任何帮助，将不胜感激！谢谢！

Answer 1

您可以使其成为函数并在文件上循环。您应该知道的一件事就是不断增长的vector / data.frames，例如data[nrow(data)+1,] <- blah。它通常效率低下，因此要么以所需大小的向量等开始，然后将输出写入其中，要么进行绑定/整形。对于少量的行，您可能不会注意到，但是您将拥有更多的行。如果有兴趣，请继续阅读vectorization。

textfunction <- function(x) {
  wow <- readLines(x)
  woah <- wow[c(9:10,29)] # I think these are the lines you are referencing
  blah <- as.numeric(gsub("\\D", "", gsub("\\(.*\\)", "", woah)))
}

然后获取您的目录，获取文件名，应用您的函数并进行转置/重命名。

library(data.table)
dir = "C:/Users/ME/Documents/"
filenames <- list.files(path = dir, pattern = "*.txt", full.names = FALSE)
textreads <- lapply(filenames, function(x) textfunction(x))
data <- as.data.frame(data.table::transpose(textreads), col.names = c("Total seq Reads", "Successful Reads", "Clonotypes"), row.names = filenames)

data
          Total.seq.Reads Successful.Reads Clonotypes
text1.txt          198274           167824     153096
text2.txt          198274           167824     153096

坚持在R中执行递归操作

1 个答案: