日志文件中的Grep摘要统计信息

时间:2018-07-31 10:23:22

标签: r loops unix readr

我用STAR对RNA-Seq分析进行比对得到了数千个文件。每个文件都是一个日志(“ * Log.final.out”),该日志针对每个泳道(每个样本总共4个泳道)汇总统计信息。由于必须将所有统计信息合并到一个唯一文件中,因此我必须针对每个通道为每个文件提取以下信息:输入读取数,唯一映射的读取数和唯一映射的读取%。有没有一种方法可以为每个文件提取我需要的所有信息,而无需手动应对和粘贴它们?

下面是一个有关日志文件外观的示例:

                             Started job on |   Jul 17 18:34:39
                         Started mapping on |   Jul 17 18:34:39
                                Finished on |   Jul 17 18:35:44
   Mapping speed, Million of reads per hour |   507.64

                      Number of input reads |   9165655
                  Average input read length |   76
                                UNIQUE READS:
               Uniquely mapped reads number |   7953458
                    Uniquely mapped reads % |   86.77%
                      Average mapped length |   73.74
                   Number of splices: Total |   1924655
        Number of splices: Annotated (sjdb) |   1892117
                   Number of splices: GT/AG |   1909019
                   Number of splices: GC/AG |   6636
                   Number of splices: AT/AC |   1016
           Number of splices: Non-canonical |   7984
                  Mismatch rate per base, % |   0.43%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.40
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.30
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   1179823
         % of reads mapped to multiple loci |   12.87%
    Number of reads mapped to too many loci |   9207
         % of reads mapped to too many loci |   0.10%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
             % of reads unmapped: too short |   0.22%
                 % of reads unmapped: other |   0.04%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%

1 个答案:

答案 0 :(得分:2)

尝试一下:

path <- <PATH TO *.out FILES>
files <- list.files(path, pattern = ".out")

library(tidyverse)
merge_out <- function (files) {
  df <- df <- read.delim(paste0(path, files[1]), header= F) %>% 
    filter(grepl("Number of input reads", V1) |
           grepl("Uniquely mapped reads", V1) |
           grepl("Uniquely mapped reads %", V1)) %>% 
    set_names("Var", "value")
}

results <- lapply(files, merge_out)

让我知道是否有帮助。