我有一个包含数千个分子的SDFile,我需要根据一个简单的单列文件中提供的ID从分子中提取分子。 因此,SDF的示例将是file1.sdf:
MOL108108
-Chem-8567890432
15 15 0 0 0 0 0 0 0999 V2000
6.1792 -2.6875 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
6.9542 -2.6875 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
5.4125 -2.7167 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.1667 -3.4667 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.1667 -1.9000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.7375 -3.4625 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.1000 -2.7667 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
3.1500 -4.1292 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.0542 -3.3792 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.0167 -2.0542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.8792 -2.7542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.2542 -3.7125 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.2500 -2.0792 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2875 -3.4042 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1.9542 -3.4875 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 1 0 0 0 0
1 4 2 0 0 0 0
1 5 2 0 0 0 0
6 7 1 0 0 0 0
7 11 1 0 0 0 0
6 8 2 0 0 0 0
3 9 1 0 0 0 0
3 10 2 0 0 0 0
11 13 2 0 0 0 0
2 12 1 0 0 0 0
10 13 1 0 0 0 0
9 14 2 0 0 0 0
6 15 1 0 0 0 0
11 14 1 0 0 0 0
M END
> <mol_id>
MOL108108
$$$$
MOL16520
-Chem4051902312
22 21 0 1 0 0 0 0 0999 V2000
0.2750 0.1500 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-0.2458 -0.1500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7917 -0.1500 0.0000 C 0 0 2 0 0 0 0 0 0 0 0 0
1.3167 0.1458 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.7625 0.1583 0.0000 C 0 0 1 0 0 0 0 0 0 0 0 0
-1.8083 0.1583 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7917 -0.7500 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0
-1.2833 -0.1417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.2458 -0.7500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.3167 0.7458 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-0.7625 0.7583 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.8000 0.7583 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
1.8292 -0.1542 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-2.3208 -0.1417 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
-2.8375 0.1583 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
-0.6083 1.3333 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0
1.3125 -1.0542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.7875 -1.3500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
0.2750 -1.0500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.3542 0.1458 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-1.0375 1.7583 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
-0.0333 1.4875 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 1 0 0 0 0
3 4 1 0 0 0 0
2 5 1 0 0 0 0
6 8 1 0 0 0 0
3 7 1 6 0 0 0
5 8 1 0 0 0 0
2 9 2 0 0 0 0
4 10 2 0 0 0 0
5 11 1 1 0 0 0
6 12 2 0 0 0 0
4 13 1 0 0 0 0
6 14 1 0 0 0 0
14 15 1 0 0 0 0
11 16 1 0 0 0 0
7 17 1 0 0 0 0
7 18 1 0 0 0 0
7 19 1 0 0 0 0
13 20 1 0 0 0 0
16 21 1 0 0 0 0
16 22 1 0 0 0 0
M END
> <mol_id>
MOL16520
$$$$
MOL55310
-Chem04051902312
11 11 0 0 0 0 0 0 0999 V2000
6.7292 -1.5750 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
7.5542 -1.5750 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
6.7250 -2.4000 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
6.7292 -0.7500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.9125 -1.5917 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.9667 -0.8542 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.5167 -2.3292 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.4792 -0.8917 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.6542 -0.9167 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.6917 -2.3417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.2625 -1.6375 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 2 0 0 0 0
1 4 2 0 0 0 0
1 5 1 0 0 0 0
2 6 1 0 0 0 0
5 7 2 0 0 0 0
5 8 1 0 0 0 0
8 9 2 0 0 0 0
7 10 1 0 0 0 0
9 11 1 0 0 0 0
10 11 2 0 0 0 0
M END
> <mol_id>
MOL55310
$$$$
.........
这是ID文件file2的示例:
MOL101103
MOL103108
MOL108108
我使用awk:
awk 'BEGIN{ORS="$$$$"}NR==FNR{a[$1];next}$1 in a' file2 RS="$" file1.sdf
但是结果输出不是有序的,我需要从file1.sdf中提取与file2中对应并有序的分子,这样输出将是这样的SDF:
MOL101103
-Chem-6789043209
12 12 0 0 0 0 0 0 0999 V2000
5.5667 -2.7625 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
6.3292 -2.7625 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
4.8292 -2.7917 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.6292 -3.7167 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.5542 -2.0042 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.4375 -2.1500 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.4792 -3.4375 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.7667 -3.9167 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
3.7417 -3.4542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.6917 -2.1750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.3500 -2.8292 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
2.5917 -2.8417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 1 0 0 0 0
1 4 2 0 0 0 0
1 5 2 0 0 0 0
3 6 2 0 0 0 0
3 7 1 0 0 0 0
2 8 1 0 0 0 0
7 9 2 0 0 0 0
6 10 1 0 0 0 0
9 11 1 0 0 0 0
11 12 1 0 0 0 0
10 11 2 0 0 0 0
M END
> <mol_id>
MOL101103
$$$$
MOL103108
-Chem-6789005434
14 14 0 0 0 0 0 0 0999 V2000
5.9250 -2.8417 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
2.8875 -2.9292 0.0000 N 0 3 0 0 0 0 0 0 0 0 0 0
6.6917 -2.8417 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
5.1667 -2.8750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.6542 -2.9125 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.9167 -3.6167 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.9167 -2.0667 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.7042 -3.9042 0.0000 O 0 5 0 0 0 0 0 0 0 0 0 0
2.4042 -2.1500 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.8167 -3.5292 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.7792 -2.2167 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.0167 -2.2417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.0542 -3.5542 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
7.0125 -3.7792 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2 5 1 0 0 0 0
1 3 1 0 0 0 0
1 4 1 0 0 0 0
5 12 2 0 0 0 0
1 6 2 0 0 0 0
1 7 2 0 0 0 0
2 8 1 0 0 0 0
2 9 2 0 0 0 0
4 10 1 0 0 0 0
4 11 2 0 0 0 0
11 12 1 0 0 0 0
10 13 2 0 0 0 0
3 14 1 0 0 0 0
5 13 1 0 0 0 0
M CHG 2 2 1 8 -1
M END
> <mol_id>
MOL103108
$$$$
MOL108108
-Chem-8567890432
12 12 0 0 0 0 0 0 0999 V2000
5.8875 -2.8500 0.0000 S 0 0 0 0 0 0 0 0 0 0 0 0
6.6500 -2.8500 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
5.1542 -2.8750 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
5.8792 -3.7292 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
5.8792 -2.0875 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
4.7542 -2.2292 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.8000 -3.5167 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
3.6667 -2.9125 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
6.9417 -3.8125 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0
2.9125 -2.9292 0.0000 N 0 0 0 0 0 0 0 0 0 0 0 0
4.0167 -2.2625 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
4.0667 -3.5417 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1 0 0 0 0
1 3 1 0 0 0 0
1 4 2 0 0 0 0
1 5 2 0 0 0 0
3 6 2 0 0 0 0
3 7 1 0 0 0 0
8 12 1 0 0 0 0
2 9 1 0 0 0 0
8 10 1 0 0 0 0
6 11 1 0 0 0 0
7 12 2 0 0 0 0
8 11 2 0 0 0 0
M END
> <mol_id>
MOL108108
$$$$
......
因此,输出文件的第一个分子是ID文件的第一个分子,依此类推。谢谢!
答案 0 :(得分:1)
我无法弄清楚您的输入格式或输出中的几个数据项来自何处,但这是按照文件ID中的ID顺序从file1打印记录的通用方法file2:
$ cat tst.awk
NR==FNR {
idSet[$0]
idOrder[++numIds] = $0
next
}
$1 in idSet { id = $1 }
$1 !~ /^[0-9.]+$/ {
rec[id] = rec[id] $0 ORS
}
END {
for (idNr=1; idNr<=numIds; idNr++) {
id = idOrder[idNr]
if (id in rec) {
print rec[id]
}
}
}
。
$ awk -f tst.awk file2 file1
MOL108108
-Chem-8567890432
M END
> <mol_id>
MOL108108
$$$$
MOL450987
[…]
M END
> <mol_id>
MOL450987
$$$$
适合的按摩。
答案 1 :(得分:1)
领养你原来的awk:
awk 'BEGIN{RS="\\$\\$\\$\\$"; ORS="$$$$"}
(NR==FNR){a[$1]=$0; next}
($1 in a) { print a[$1] }' file1.sdf RS="\n" file2.txt
想法是将SDF文件读入内存,逐条记录。
记录分隔符为$$$$
。您可以在Gnu awk中将其设置为RS="\\$\\$\\$\\$"
。在这里,您需要转义$
,因为它作为正则表达式(末尾的锚)具有特殊含义。正在进行两次越狱。转义词是字典分析器或awk将\\$
转换为\$
,然后再转义为$
。
输出记录分隔符(打印记录时使用的分隔符)仅为ORS="$$$$"
。这里我们不需要转义它,因为它是普通字符串。
对于第一个文件(NR==FNR)
,我们将完整记录$0
存储在由第一个字段(分子名称)索引的数组中。 (a[$1]=$0
。
第二个文件具有常规记录分隔符,作为换行符(RS="\n"
)。因此,每次读取记录时,都会检查它是否为a
的元素,如果是,则将其打印出来。