我有一个命令日志文件,我想以表格格式选择一些信息。输入是这样的:
####################################################################################################
# Starting pipeline at Mon Jul 29 12:22:56 CEST 2013
# Input files: test.fastq
# Output Log: .bpipe/logs/27790.log
# Stage Results
mkdir ./QC_graphics_results/
####################################################################################################
# Starting pipeline at Mon Jul 29 12:22:57 CEST 2013
# Input files: test.fastq
# Output Log: .bpipe/logs/27790.log
# Stage Statistics_graph_2
fastqc test.fastq -o ./QC_graphics_results/
mv .QC_graphics_results/*fastqc .QC_graphics_results/fastqc
####################################################################################################
# Starting pipeline at Mon Jul 29 12:24:18 CEST 2013
# Input files: test.fastq
# Output Log: .bpipe/logs/27790.log
# Stage GC_content [all]
# Stage Dinucleotide_odds [all]
# Stage Sequence_duplication [all]
prinseq-lite.pl -fastq test.fastq -graph_data test.Dinucleotide_odds.gd -graph_stats dn -out_good null -out_bad null
prinseq-lite.pl -fastq test.fastq -graph_data test.Sequence_duplication.gd -graph_stats da -out_good null -out_bad null
prinseq-lite.pl -fastq test.fastq -graph_data test.GC_content.gd -graph_stats gc -out_good null -out_bad null
所需的输出将是包含每个阶段和命令的表,如下所示:
Stage result mkdir./QC_grahics_results/
Stage Statistics_graph_2 fastqc test.fastq -o ./QC_graphics_results/
Stage GC_content [all] prinseq-lite.pl -fastq test.fastq -graph_data test.GC_content.gd -graph_stats gc -out_good null -out_bad null
Dinucleotide_odds [all] prinseq-lite.pl -fastq test.fastq -graph_data test.Sequence_duplication.gd -graph_stats da -out_good null -out_bad null
Stage Sequence_duplication [all] prinseq-lite.pl -fastq test.fastq -graph_data test.GC_content.gd -graph_stats gc -out_good null -out_bad null
我一直在尝试使用以下代码使用AWK,但我不起作用。有什么建议?
cat commandlog.txt | awk '/^#\ Stage*/{print $0} !/^#.*/{print $0}' | awk '{ if ($0 ~ /^#*/){ if (b=1){next} else {a=$0 b=1 next;} else { if (NF!=0){func=$0 b=0 print $a\t$func\n}}' > ./statistic_files/user_options
答案 0 :(得分:1)
将其保存在名为awk0的文件中。
NF == 0 {next} substr($1,1,1) == "#" && $2 != "Stage" {next} $2 == "Stage" && NF == 3 {stage_name = $2 " " $3 next } stage_name != "" {print stage_name, $0 stage_name = "" next} $2 == "Stage" {arr[$3] = "" next} { {for (i in arr) { if (match($0, i) != 0) print "Stage", i, $0 }; } }
然后运行:
cat commandlog.txt | awk -f awk0 > ./statistic_files/user_options
输出:
Stage Results mkdir ./QC_graphics_results/ Stage Statistics_graph_2 fastqc test.fastq -o ./QC_graphics_results/ Stage Dinucleotide_odds prinseq-lite.pl -fastq test.fastq -graph_data test.Dinucleotide_odds.gd -graph_stats dn -out_good null -out_bad null Stage Sequence_duplication prinseq-lite.pl -fastq test.fastq -graph_data test.Sequence_duplication.gd -graph_stats da -out_good null -out_bad null Stage GC_content prinseq-lite.pl -fastq test.fastq -graph_data test.GC_content.gd -graph_stats祝你好运!
答案 1 :(得分:0)
我同意这个问题在使用简单的工具进行微不足道的解决方案时会被弱化,请在bash中尝试这样的事情:
for x in $(awk '/Stage /{print $3}' file.txt);
do
g=`grep "test.$x.gd" file.txt`;
test -z "$g" && g=`awk "/Stage ${x}/,/##/" file.txt | grep -v '#'`
echo -e "Stage $x\t$g";
done
它将从段落中获取阶段名称(不含空格),然后尝试将其与-graph_data
参数行映射,如果找不到匹配项,它将在“阶段名称”声明和下一个之间获取行启动pargraph(假设段落从##
序列开始)。应该工作。