根据另一个文件中给出的ID从SDF文件中依次提取分子

时间:2019-05-17 03:02:14

标签: unix awk extract bioinformatics sdf

我有一个包含数千个分子的SDFile,我需要根据一个简单的单列文件中提供的ID从分子中提取分子。 因此,SDF的示例将是file1.sdf:

MOL108108
  -Chem-8567890432

 15 15  0     0  0  0  0  0  0999 V2000
    6.1792   -2.6875    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    6.9542   -2.6875    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.4125   -2.7167    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.1667   -3.4667    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    6.1667   -1.9000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.7375   -3.4625    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.1000   -2.7667    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    3.1500   -4.1292    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.0542   -3.3792    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.0167   -2.0542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.8792   -2.7542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.2542   -3.7125    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.2500   -2.0792    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2875   -3.4042    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.9542   -3.4875    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  1  0  0  0  0
  1  4  2  0  0  0  0
  1  5  2  0  0  0  0
  6  7  1  0  0  0  0
  7 11  1  0  0  0  0
  6  8  2  0  0  0  0
  3  9  1  0  0  0  0
  3 10  2  0  0  0  0
 11 13  2  0  0  0  0
  2 12  1  0  0  0  0
 10 13  1  0  0  0  0
  9 14  2  0  0  0  0
  6 15  1  0  0  0  0
 11 14  1  0  0  0  0
M  END
> <mol_id>
MOL108108

$$$$
MOL16520
  -Chem4051902312

 22 21  0     1  0  0  0  0  0999 V2000
    0.2750    0.1500    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -0.2458   -0.1500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7917   -0.1500    0.0000 C   0  0  2  0  0  0  0  0  0  0  0  0
    1.3167    0.1458    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.7625    0.1583    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
   -1.8083    0.1583    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7917   -0.7500    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
   -1.2833   -0.1417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.2458   -0.7500    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.3167    0.7458    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -0.7625    0.7583    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.8000    0.7583    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.8292   -0.1542    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -2.3208   -0.1417    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
   -2.8375    0.1583    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
   -0.6083    1.3333    0.0000 C   0  0  3  0  0  0  0  0  0  0  0  0
    1.3125   -1.0542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.7875   -1.3500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.2750   -1.0500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.3542    0.1458    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.0375    1.7583    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.0333    1.4875    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  1  0  0  0  0
  3  4  1  0  0  0  0
  2  5  1  0  0  0  0
  6  8  1  0  0  0  0
  3  7  1  6  0  0  0
  5  8  1  0  0  0  0
  2  9  2  0  0  0  0
  4 10  2  0  0  0  0
  5 11  1  1  0  0  0
  6 12  2  0  0  0  0
  4 13  1  0  0  0  0
  6 14  1  0  0  0  0
 14 15  1  0  0  0  0
 11 16  1  0  0  0  0
  7 17  1  0  0  0  0
  7 18  1  0  0  0  0
  7 19  1  0  0  0  0
 13 20  1  0  0  0  0
 16 21  1  0  0  0  0
 16 22  1  0  0  0  0
M  END
> <mol_id>
MOL16520

$$$$
MOL55310
  -Chem04051902312

 11 11  0     0  0  0  0  0  0999 V2000
    6.7292   -1.5750    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    7.5542   -1.5750    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    6.7250   -2.4000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    6.7292   -0.7500    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.9125   -1.5917    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.9667   -0.8542    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.5167   -2.3292    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.4792   -0.8917    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.6542   -0.9167    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.6917   -2.3417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2625   -1.6375    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  2  0  0  0  0
  1  4  2  0  0  0  0
  1  5  1  0  0  0  0
  2  6  1  0  0  0  0
  5  7  2  0  0  0  0
  5  8  1  0  0  0  0
  8  9  2  0  0  0  0
  7 10  1  0  0  0  0
  9 11  1  0  0  0  0
 10 11  2  0  0  0  0
M  END
> <mol_id>
MOL55310

$$$$

.........

这是ID文件file2的示例:

MOL101103
MOL103108
MOL108108

我使用awk: awk 'BEGIN{ORS="$$$$"}NR==FNR{a[$1];next}$1 in a' file2 RS="$" file1.sdf

但是结果输出不是有序的,我需要从file1.sdf中提取与file2中对应并有序的分子,这样输出将是这样的SDF:

MOL101103
  -Chem-6789043209

12 12  0     0  0  0  0  0  0999 V2000
    5.5667   -2.7625    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    6.3292   -2.7625    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    4.8292   -2.7917    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.6292   -3.7167    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.5542   -2.0042    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.4375   -2.1500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.4792   -3.4375    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.7667   -3.9167    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    3.7417   -3.4542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.6917   -2.1750    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.3500   -2.8292    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    2.5917   -2.8417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  1  0  0  0  0
  1  4  2  0  0  0  0
  1  5  2  0  0  0  0
  3  6  2  0  0  0  0
  3  7  1  0  0  0  0
  2  8  1  0  0  0  0
  7  9  2  0  0  0  0
  6 10  1  0  0  0  0
  9 11  1  0  0  0  0
 11 12  1  0  0  0  0
 10 11  2  0  0  0  0
M  END
> <mol_id>
MOL101103

$$$$
MOL103108
  -Chem-6789005434

14 14  0     0  0  0  0  0  0999 V2000
    5.9250   -2.8417    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    2.8875   -2.9292    0.0000 N   0  3  0  0  0  0  0  0  0  0  0  0
    6.6917   -2.8417    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.1667   -2.8750    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.6542   -2.9125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.9167   -3.6167    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.9167   -2.0667    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.7042   -3.9042    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
    2.4042   -2.1500    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.8167   -3.5292    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.7792   -2.2167    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.0167   -2.2417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.0542   -3.5542    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.0125   -3.7792    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  2  5  1  0  0  0  0
  1  3  1  0  0  0  0
  1  4  1  0  0  0  0
  5 12  2  0  0  0  0
  1  6  2  0  0  0  0
  1  7  2  0  0  0  0
  2  8  1  0  0  0  0
  2  9  2  0  0  0  0
  4 10  1  0  0  0  0
  4 11  2  0  0  0  0
 11 12  1  0  0  0  0
 10 13  2  0  0  0  0
  3 14  1  0  0  0  0
  5 13  1  0  0  0  0
M  CHG  2   2   1   8  -1
M  END
> <mol_id>
MOL103108

$$$$
MOL108108
  -Chem-8567890432

12 12  0     0  0  0  0  0  0999 V2000
    5.8875   -2.8500    0.0000 S   0  0  0  0  0  0  0  0  0  0  0  0
    6.6500   -2.8500    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    5.1542   -2.8750    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.8792   -3.7292    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.8792   -2.0875    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    4.7542   -2.2292    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.8000   -3.5167    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.6667   -2.9125    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.9417   -3.8125    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    2.9125   -2.9292    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0
    4.0167   -2.2625    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.0667   -3.5417    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  1  0  0  0  0
  1  3  1  0  0  0  0
  1  4  2  0  0  0  0
  1  5  2  0  0  0  0
  3  6  2  0  0  0  0
  3  7  1  0  0  0  0
  8 12  1  0  0  0  0
  2  9  1  0  0  0  0
  8 10  1  0  0  0  0
  6 11  1  0  0  0  0
  7 12  2  0  0  0  0
  8 11  2  0  0  0  0
M  END
> <mol_id>
MOL108108

$$$$

......

因此,输出文件的第一个分子是ID文件的第一个分子,依此类推。谢谢!

2 个答案:

答案 0 :(得分:1)

我无法弄清楚您的输入格式或输出中的几个数据项来自何处,但这是按照文件ID中的ID顺序从file1打印记录的通用方法file2:

$ cat tst.awk
NR==FNR {
    idSet[$0]
    idOrder[++numIds] = $0
    next
}
$1 in idSet { id = $1 }
$1 !~ /^[0-9.]+$/ {
    rec[id] = rec[id] $0 ORS
}
END {
    for (idNr=1; idNr<=numIds; idNr++) {
        id = idOrder[idNr]
        if (id in rec) {
            print rec[id]
        }
    }
}

$ awk -f tst.awk file2 file1
MOL108108
  -Chem-8567890432

M  END
> <mol_id>
MOL108108

$$$$
MOL450987
[…]
M  END
> <mol_id>
MOL450987

$$$$

适合的按摩。

答案 1 :(得分:1)

领养你原来的awk:

awk 'BEGIN{RS="\\$\\$\\$\\$"; ORS="$$$$"}
     (NR==FNR){a[$1]=$0; next}
     ($1 in a) { print a[$1] }' file1.sdf RS="\n" file2.txt

想法是将SDF文件读入内存,逐条记录。

  • 记录分隔符为$$$$。您可以在Gnu awk中将其设置为RS="\\$\\$\\$\\$"。在这里,您需要转义$,因为它作为正则表达式(末尾的锚)具有特殊含义。正在进行两次越狱。转义词是字典分析器或awk将\\$转换为\$,然后再转义为$

  • 输出记录分隔符(打印记录时使用的分隔符)仅为ORS="$$$$"。这里我们不需要转义它,因为它是普通字符串。

对于第一个文件(NR==FNR),我们将完整记录$0存储在由第一个字段(分子名称)索引的数组中。 (a[$1]=$0

第二个文件具有常规记录分隔符,作为换行符(RS="\n")。因此,每次读取记录时,都会检查它是否为a的元素,如果是,则将其打印出来。