我用STAR对RNA-Seq分析进行比对得到了数千个文件。每个文件都是一个日志(“ * Log.final.out”),该日志针对每个泳道(每个样本总共4个泳道)汇总统计信息。由于必须将所有统计信息合并到一个唯一文件中,因此我必须针对每个通道为每个文件提取以下信息:输入读取数,唯一映射的读取数和唯一映射的读取%。有没有一种方法可以为每个文件提取我需要的所有信息,而无需手动应对和粘贴它们?
下面是一个有关日志文件外观的示例:
Started job on | Jul 17 18:34:39
Started mapping on | Jul 17 18:34:39
Finished on | Jul 17 18:35:44
Mapping speed, Million of reads per hour | 507.64
Number of input reads | 9165655
Average input read length | 76
UNIQUE READS:
Uniquely mapped reads number | 7953458
Uniquely mapped reads % | 86.77%
Average mapped length | 73.74
Number of splices: Total | 1924655
Number of splices: Annotated (sjdb) | 1892117
Number of splices: GT/AG | 1909019
Number of splices: GC/AG | 6636
Number of splices: AT/AC | 1016
Number of splices: Non-canonical | 7984
Mismatch rate per base, % | 0.43%
Deletion rate per base | 0.01%
Deletion average length | 1.40
Insertion rate per base | 0.01%
Insertion average length | 1.30
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 1179823
% of reads mapped to multiple loci | 12.87%
Number of reads mapped to too many loci | 9207
% of reads mapped to too many loci | 0.10%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 0.22%
% of reads unmapped: other | 0.04%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
答案 0 :(得分:2)
尝试一下:
path <- <PATH TO *.out FILES>
files <- list.files(path, pattern = ".out")
library(tidyverse)
merge_out <- function (files) {
df <- df <- read.delim(paste0(path, files[1]), header= F) %>%
filter(grepl("Number of input reads", V1) |
grepl("Uniquely mapped reads", V1) |
grepl("Uniquely mapped reads %", V1)) %>%
set_names("Var", "value")
}
results <- lapply(files, merge_out)
让我知道是否有帮助。