根据另一行的内容反复生成新文件

时间:2019-06-03 17:53:26

标签: linux shell loops awk extract

我有两个不同的文件,我想从中提取一些行并生成新文件。 所以我的第一个文件是file1.tsv:

A       B       C       D       E       Example  Set     Group
0       0       27      0       0       exA sub9    1
0       0       45      12      12      exA sub14   0
1       1       45      14      6       exA sub6    0
2       2       65      7       8       exA sub2    1
3       3       68      9       14      exA sub13   0
4       4       70      8       13      exA sub5    0
5       5       75      3       11      exA sub8    1
6       6       79      10      7       exA sub7    1
7       7       85      13      5       exA sub12   1
8       8       88      5       4       exA sub1    0
9       9       90      1       1       exA sub10   1
10      10      92      2       2       exA sub3    0
11      11      98      4       3       exA sub4    1
12      12      108     12      10      exA sub11   1

我的第二个文件是矢量file2.vec:

1 1:3.000 2:0.000 3:0.000 4:4.000 5:0.000 #(Aid=sub1, Bid=exA, group=1)
2 1:0.000 2:1.000 3:2.000 4:5.000 5:0.000 #(Aid=sub2, Bid=exA, group=2)
1 1:2.000 2:3.000 3:0.000 4:0.000 5:0.000 #(Aid=sub3, Bid=exA, group=1)
2 1:0.000 2:5.000 3:1.000 4:2.000 5:0.000 #(Aid=sub4, Bid=exA, group=2)
1 1:0.000 2:1.000 3:1.000 4:2.000 5:0.000 #(Aid=sub5, Bid=exA, group=1)
1 1:5.000 2:0.000 3:1.000 4:3.000 5:0.000 #(Aid=sub6, Bid=exA, group=1)
2 1:1.000 2:0.000 3:1.000 4:1.000 5:0.000 #(Aid=sub7, Bid=exA, group=2)
1 1:4.000 2:2.000 3:0.000 4:1.000 5:0.000 #(Aid=sub8, Bid=exA, group=1)
2 1:0.000 2:1.000 3:0.000 4:4.000 5:0.000 #(Aid=sub9, Bid=exA, group=2)
2 1:0.000 2:0.000 3:1.000 4:0.000 5:0.000 #(Aid=sub10, Bid=exA, group=2)
2 1:4.000 2:2.000 3:1.000 4:2.000 5:0.000 #(Aid=sub11, Bid=exA, group=2)
2 1:0.000 2:4.000 3:1.000 4:2.000 5:0.000 #(Aid=sub12, Bid=exA, group=2)
1 1:4.000 2:2.000 3:1.000 4:0.000 5:0.000 #(Aid=sub13, Bid=exA, group=1)
1 1:2.000 2:0.000 3:1.000 4:1.000 5:0.000 #(Aid=sub14, Bid=exA, group=1)

我想使用file1.tsv的第7列(标题:Set)中的数据来生成新文件,在该文件中将打印来自file2.vec的相应行,对于每次迭代,我都想添加一个新行到前一个输出。因此,例如,第一行(如果不计算标题的话)是file1.tsv中的sub9,并且可以使用Aid链接来自file2.vec的相应数据,因此输出为:

out1.vec 
2 1:0.000 2:1.000 3:0.000 4:4.000 5:0.000

我现在希望有多个这样的输出:

out2.vec 
2 1:0.000 2:1.000 3:0.000 4:4.000 5:0.000
1 1:2.000 2:0.000 3:1.000 4:1.000 5:0.000

out3.vec 
2 1:0.000 2:1.000 3:0.000 4:4.000 5:0.000
1 1:2.000 2:0.000 3:1.000 4:1.000 5:0.000
1 1:5.000 2:0.000 3:1.000 4:3.000 5:0.000

...
out4-13

out14.vec 
2 1:0.000 2:1.000 3:0.000 4:4.000 5:0.000
1 1:2.000 2:0.000 3:1.000 4:1.000 5:0.000
1 1:5.000 2:0.000 3:1.000 4:3.000 5:0.000
2 1:0.000 2:1.000 3:2.000 4:5.000 5:0.000
1 1:4.000 2:2.000 3:1.000 4:0.000 5:0.000
1 1:0.000 2:1.000 3:1.000 4:2.000 5:0.000
1 1:4.000 2:2.000 3:0.000 4:1.000 5:0.000
2 1:1.000 2:0.000 3:1.000 4:1.000 5:0.000
2 1:0.000 2:4.000 3:1.000 4:2.000 5:0.000
1 1:3.000 2:0.000 3:0.000 4:4.000 5:0.000
2 1:0.000 2:0.000 3:1.000 4:0.000 5:0.000
1 1:2.000 2:3.000 3:0.000 4:0.000 5:0.000
2 1:0.000 2:5.000 3:1.000 4:2.000 5:0.000
2 1:4.000 2:2.000 3:1.000 4:2.000 5:0.000

我有一个包含多个文件的目录,如file1.tsv,对于每个文件,我都想执行上述步骤。所以我试图写一个shell脚本:

# first to extract column 7 
for filename in File; do
        listFile=$(basename "$filename" .tsv)-cmpdsList.tsv
        awk '{if (NR!=1) {print $7}}' $filename \
        > $listFile
done

# second to generate files containing lines from previously generated list
for line in $(cat $listFile); do
        echo "$line" > $line.vec
done

# add information corresponding to the compounds to generate vector file
for file in $line.vec; do
        output=$(basename "$line.vec" .vec)-output.vec
        gawk 'BEGIN {RS="\n"; ORS="\n"} (NR==FNR){a[$1]=$0; next} ($1 in a){print a[$1]}' $file RS="\n" $line.vec > $output
 done

但是它只会生成空的矢量文件。谢谢!

1 个答案:

答案 0 :(得分:0)

首先对您的代码进行注释:

# first to extract column 7 
for filename in File; do
  • File是一个字符串。也许您想要在这里添加文件?
        listFile=$(basename "$filename" .tsv)-cmpdsList.tsv
        awk '{if (NR!=1) {print $7}}' $filename \
        > $listFile
  • awk命令可以简化为:awk 'NR>1 {print $7}' ...
  • 引用$listFile的使用更安全:"$listFile"
done

# second to generate files containing lines from previously generated list
for line in $(cat $listFile); do
        echo "$line" > $line.vec
done
  • 如果$linesub9或其他类似名称,则它也不能是out1
# add information corresponding to the compounds to generate vector file
for file in $line.svm; do
        output=$(basename "$line.svm" .svm)-output.vec
        gawk 'BEGIN {RS="\n"; ORS="\n"} (NR==FNR){a[$1]=$0; next} ($1 in a){print a[$1]}' $file RS="\n" $line.svm > $output
 done
  • 您的示例不包含任何.svm文件,因此很难解释此代码

可能的awk解决方案

根据您的file1.tsvfile2.vec(假设第一行开头的1缺少错字)和输出说明,可能的awk解决方案是:

awk '
    NR==FNR && NR>1 {
        n++
        aid[$7] = n
        next
    }
    NR!=FNR {
        pat = $7
        sub(/^#[(]Aid=/, "", pat)
        sub(/,$/, "", pat)
        sub(/ #.*$/, "", $0)
        line[ aid[pat] ] = $0
    }
    END {
        for (i=1; i<=n; i++) {
            out = "out" i ".vec"
            printf "" > out
            for (j=1; j<=i; j++) {
                print line[j] >> out
            }
            close(out)
        }
    }
' file1.tsv file2.vec
  • NR==FNR ...-提取ID并将其映射到行号
  • NR!=FNR ...-从$7中的id中找出一行,从$0中删除尾随字段并存储
  • END ...-对于每一行,将其及其所有内容输出到适当的输出文件中
  • close-写入后关闭文件,以避免用完文件描述符