我正在处理一个脚本,我在格式化输出时遇到了问题。 索引和输入文件如下所示:
index
Pseudopropionibacterium propionicum
Kibdelosporangium phytohabitans
Steroidobacter denitrificans
File 1
Pseudopropionibacterium propionicum 1591.0
Kibdelosporangium phytohabitans 907.0
Olsenella sp. oral taxon 807 7323.0 oral bacterium
Steroidobacter denitrificans 6673.0 sludge bacterium
File 2
Pseudopropionibacterium propionicum 123.0
Caulobacteraceae bacterium OTSz_A_272 1019.0
Saccharopolyspora erythraea 939.0 soil bacterium
Rhodopseudomonas palustris 900.0
Nitrospira moscoviensis 856.0 soil/water bacterium
File 3
Pseudopropionibacterium propionicum 1591.0
Kibdelosporangium phytohabitans 907.0
Verrucosispora maris 391.0 deep-sea actinomycete
Tannerella forsythia 389.0 periodontal pathogen
Actinoplanes missouriensis 376.0 soil bacterium
脚本的作用是在索引的帮助下查找文件2中的匹配项,并打印出文件2中的第一和第二字段。但是,这是针对多个文件2(全部看起来相同)和我想为每个新文件2的输出创建一个新列。
我的代码直到现在:
#!/bin/bash
for file in ./*_TOP1000
do
basename $file >> output
awk 'BEGIN{FS="\t"}NR==FNR{a[$1]=$0;next}$1 in a{print $1,$2}' index $file >> output
done
输出如下:
File 1
Pseudopropionibacterium propionicum 1591.0
Kibdelosporangium phytohabitans 907.0
Steroidobacter denitrificans 6673.0
File 2
Pseudopropionibacterium propionicum 4326.0
File 3
Kibdelosporangium phytohabitans 1591.0
Pseudopropionibacterium propionicum 907.0
但它希望以这种方式进入:
File 1 File 2 File 3
Pseudopropionibacterium propionicum 1591.0 Pseudopropionibacterium propionicum 4326.0 Pseudopropionibacterium propionicum 907.0
Kibdelosporangium phytohabitans 907.0 Kibdelosporangium phytohabitans 1591.0
Steroidobacter denitrificans 6673.0
直接在其下面匹配结果。所有文件都可以有不同的匹配
我尝试使用column
命令偷偷溜进分隔符来解决它,但它无法正常工作。那么如何归档所需的输出呢?
答案 0 :(得分:2)
$ cat tst.awk
BEGIN { OFS="\t" }
NR==FNR { indices[$1]; next }
FNR==1 { filenames[++numCols] = FILENAME }
$1 in indices {
vals[numCols,++rowCnt[numCols]] = $1 FS $2 FS $3
numRows = (rowCnt[numCols] > numRows ? rowCnt[numCols] : numRows)
}
END {
for (colNr=1; colNr<=numCols; colNr++) {
printf "%s%s", filenames[colNr], (colNr<numCols ? OFS : ORS)
}
for (rowNr=1; rowNr<=numRows; rowNr++) {
for (colNr=1; colNr<=numCols; colNr++) {
printf "%s%s", vals[colNr,rowNr], (colNr<numCols ? OFS : ORS)
}
}
}
$ awk -f tst.awk index file1 file2 file3 | column -s$'\t' -t
file1 file2 file3
Pseudopropionibacterium propionicum 1591.0 Pseudopropionibacterium propionicum 123.0 Pseudopropionibacterium propionicum 1591.0
Kibdelosporangium phytohabitans 907.0 Kibdelosporangium phytohabitans 907.0
Steroidobacter denitrificans 6673.0
column
的管道只是以对齐列显示输出而不是以制表符分隔。
答案 1 :(得分:1)
类似这样的东西,在GNU awk中,因为match
的第三个参数:
$ awk '
NR==FNR { a[$0]; next } # read and hash index file to a
FNR==1 { print FILENAME } # print filename at start of data files
{
match($0,/^([^0-9]+)([0-9.]+)/,b) # get the name part and first value
gsub(/^ +| +$/,"",b[1]) # trim name
if(b[1] in a) # print indexed
print b[1],b[2]
}' index file1 file1
file1
Pseudopropionibacterium propionicum 4326.0
Kibdelosporangium phytohabitans 3819.0
file1
Pseudopropionibacterium propionicum 4326.0
Kibdelosporangium phytohabitans 3819.0
由于2D数组,字段版本将用于GNU awk:
$ cat program.awk
NR==FNR { a[$0]; next } # read and hash index file to a
FNR==1 { c[++i][j=1]=FILENAME } # print filename at start of data files
{
match($0,/^([^0-9]+)([0-9.]+)/,b) # get the name part and first value
gsub(/^ +| +$/,"",b[1]) # trim name
if(b[1] in a) { # print indexed
c[i][++j]=b[1] OFS b[2]
if(m<j||m=="") m=j # max col count
if(l[i]<=length(b[1] OFS b[2])||l[i]=="")
l[i]=length(b[1] OFS b[2]) # this is for printf width
}
}
END {
for(k=1;k<=m;k++)
for(j=1;j<=i;j++)
printf "%-" l[k] "s %s", c[j][k], (j==i?ORS:OFS)
}
测试它:
$ awk -f index file1 file2 file3
file1 file2 file3
Pseudopropionibacterium propionicum 1591.0 Pseudopropionibacterium propionicum 123.0 Pseudopropionibacterium propionicum 1591.0
Kibdelosporangium phytohabitans 907.0 Kibdelosporangium phytohabitans 907.0
Steroidobacter denitrificans 6673.0
答案 2 :(得分:1)
使用Perl重新排列表可能比使用Awk更容易。
如果您按正确顺序向column
提供数据,则会正确格式化列。使用选项-t
并使用-s
指定列分隔符。
#! /usr/bin/perl
use strict;
use warnings;
my $table; # declares variables.
my $col = -1;
my $row = 0;
while (<DATA>) # loop through the input line by line
{
chomp; # remove end of line
if (/^File/) { $col++; $row = 0; } # increment col and init row if line starts with File
$table->[$row++]->[$col] = $_; # set value in two dimensional array and increment row
}
open (my $out, '|-', "column -s ^ -t"); # open pipe to columns
foreach (@$table) # loop over the rows of the table
{
print $out join('^', map { $_ or ' ' } @$_), "\n"; # join the elements of a row with the delimiter ^ and replace undefined values with a space
}
close $out;
__DATA__
File 1
Pseudopropionibacterium propionicum 1591.0
Kibdelosporangium phytohabitans 907.0
File 2
Pseudopropionibacterium propionicum 4326.0
File 3
Kibdelosporangium phytohabitans 2019.0
Pseudopropionibacterium propionicum 1542.0
以这种方式打印列:
File 1 File 2 File 3 Pseudopropionibacterium propionicum 1591.0 Pseudopropionibacterium propionicum 4326.0 Kibdelosporangium phytohabitans 2019.0 Kibdelosporangium phytohabitans 907.0 Pseudopropionibacterium propionicum 1542.0
如果您想阅读标准输入而不是Perl的数据段,请将<DATA>
更改为<*>
。