我有两个不同的文件,我想从中提取一些行并生成新文件。 所以我的第一个文件是file1.tsv:
A B C D E Example Set Group
0 0 27 0 0 exA sub9 1
0 0 45 12 12 exA sub14 0
1 1 45 14 6 exA sub6 0
2 2 65 7 8 exA sub2 1
3 3 68 9 14 exA sub13 0
4 4 70 8 13 exA sub5 0
5 5 75 3 11 exA sub8 1
6 6 79 10 7 exA sub7 1
7 7 85 13 5 exA sub12 1
8 8 88 5 4 exA sub1 0
9 9 90 1 1 exA sub10 1
10 10 92 2 2 exA sub3 0
11 11 98 4 3 exA sub4 1
12 12 108 12 10 exA sub11 1
我的第二个文件是矢量file2.vec:
1 1:3.000 2:0.000 3:0.000 4:4.000 5:0.000 #(Aid=sub1, Bid=exA, group=1)
2 1:0.000 2:1.000 3:2.000 4:5.000 5:0.000 #(Aid=sub2, Bid=exA, group=2)
1 1:2.000 2:3.000 3:0.000 4:0.000 5:0.000 #(Aid=sub3, Bid=exA, group=1)
2 1:0.000 2:5.000 3:1.000 4:2.000 5:0.000 #(Aid=sub4, Bid=exA, group=2)
1 1:0.000 2:1.000 3:1.000 4:2.000 5:0.000 #(Aid=sub5, Bid=exA, group=1)
1 1:5.000 2:0.000 3:1.000 4:3.000 5:0.000 #(Aid=sub6, Bid=exA, group=1)
2 1:1.000 2:0.000 3:1.000 4:1.000 5:0.000 #(Aid=sub7, Bid=exA, group=2)
1 1:4.000 2:2.000 3:0.000 4:1.000 5:0.000 #(Aid=sub8, Bid=exA, group=1)
2 1:0.000 2:1.000 3:0.000 4:4.000 5:0.000 #(Aid=sub9, Bid=exA, group=2)
2 1:0.000 2:0.000 3:1.000 4:0.000 5:0.000 #(Aid=sub10, Bid=exA, group=2)
2 1:4.000 2:2.000 3:1.000 4:2.000 5:0.000 #(Aid=sub11, Bid=exA, group=2)
2 1:0.000 2:4.000 3:1.000 4:2.000 5:0.000 #(Aid=sub12, Bid=exA, group=2)
1 1:4.000 2:2.000 3:1.000 4:0.000 5:0.000 #(Aid=sub13, Bid=exA, group=1)
1 1:2.000 2:0.000 3:1.000 4:1.000 5:0.000 #(Aid=sub14, Bid=exA, group=1)
我想使用file1.tsv的第7列(标题:Set)中的数据来生成新文件,在该文件中将打印来自file2.vec的相应行,对于每次迭代,我都想添加一个新行到前一个输出。因此,例如,第一行(如果不计算标题的话)是file1.tsv中的sub9,并且可以使用Aid链接来自file2.vec的相应数据,因此输出为:
out1.vec
2 1:0.000 2:1.000 3:0.000 4:4.000 5:0.000
我现在希望有多个这样的输出:
out2.vec
2 1:0.000 2:1.000 3:0.000 4:4.000 5:0.000
1 1:2.000 2:0.000 3:1.000 4:1.000 5:0.000
out3.vec
2 1:0.000 2:1.000 3:0.000 4:4.000 5:0.000
1 1:2.000 2:0.000 3:1.000 4:1.000 5:0.000
1 1:5.000 2:0.000 3:1.000 4:3.000 5:0.000
...
out4-13
out14.vec
2 1:0.000 2:1.000 3:0.000 4:4.000 5:0.000
1 1:2.000 2:0.000 3:1.000 4:1.000 5:0.000
1 1:5.000 2:0.000 3:1.000 4:3.000 5:0.000
2 1:0.000 2:1.000 3:2.000 4:5.000 5:0.000
1 1:4.000 2:2.000 3:1.000 4:0.000 5:0.000
1 1:0.000 2:1.000 3:1.000 4:2.000 5:0.000
1 1:4.000 2:2.000 3:0.000 4:1.000 5:0.000
2 1:1.000 2:0.000 3:1.000 4:1.000 5:0.000
2 1:0.000 2:4.000 3:1.000 4:2.000 5:0.000
1 1:3.000 2:0.000 3:0.000 4:4.000 5:0.000
2 1:0.000 2:0.000 3:1.000 4:0.000 5:0.000
1 1:2.000 2:3.000 3:0.000 4:0.000 5:0.000
2 1:0.000 2:5.000 3:1.000 4:2.000 5:0.000
2 1:4.000 2:2.000 3:1.000 4:2.000 5:0.000
我有一个包含多个文件的目录,如file1.tsv,对于每个文件,我都想执行上述步骤。所以我试图写一个shell脚本:
# first to extract column 7
for filename in File; do
listFile=$(basename "$filename" .tsv)-cmpdsList.tsv
awk '{if (NR!=1) {print $7}}' $filename \
> $listFile
done
# second to generate files containing lines from previously generated list
for line in $(cat $listFile); do
echo "$line" > $line.vec
done
# add information corresponding to the compounds to generate vector file
for file in $line.vec; do
output=$(basename "$line.vec" .vec)-output.vec
gawk 'BEGIN {RS="\n"; ORS="\n"} (NR==FNR){a[$1]=$0; next} ($1 in a){print a[$1]}' $file RS="\n" $line.vec > $output
done
但是它只会生成空的矢量文件。谢谢!
答案 0 :(得分:0)
首先对您的代码进行注释:
# first to extract column 7
for filename in File; do
File
是一个字符串。也许您想要在这里添加文件? listFile=$(basename "$filename" .tsv)-cmpdsList.tsv
awk '{if (NR!=1) {print $7}}' $filename \
> $listFile
awk 'NR>1 {print $7}' ...
$listFile
的使用更安全:"$listFile"
done
# second to generate files containing lines from previously generated list
for line in $(cat $listFile); do
echo "$line" > $line.vec
done
$line
是sub9
或其他类似名称,则它也不能是out1
# add information corresponding to the compounds to generate vector file
for file in $line.svm; do
output=$(basename "$line.svm" .svm)-output.vec
gawk 'BEGIN {RS="\n"; ORS="\n"} (NR==FNR){a[$1]=$0; next} ($1 in a){print a[$1]}' $file RS="\n" $line.svm > $output
done
.svm
文件,因此很难解释此代码根据您的file1.tsv
和file2.vec
(假设第一行开头的1
缺少错字)和输出说明,可能的awk解决方案是:
awk '
NR==FNR && NR>1 {
n++
aid[$7] = n
next
}
NR!=FNR {
pat = $7
sub(/^#[(]Aid=/, "", pat)
sub(/,$/, "", pat)
sub(/ #.*$/, "", $0)
line[ aid[pat] ] = $0
}
END {
for (i=1; i<=n; i++) {
out = "out" i ".vec"
printf "" > out
for (j=1; j<=i; j++) {
print line[j] >> out
}
close(out)
}
}
' file1.tsv file2.vec
NR==FNR ...
-提取ID并将其映射到行号NR!=FNR ...
-从$7
中的id中找出一行,从$0
中删除尾随字段并存储END ...
-对于每一行,将其及其所有内容输出到适当的输出文件中close
-写入后关闭文件,以避免用完文件描述符