Question

我试图计算蛋白质原子（ATOM）和配体原子（HETATM）的每个坐标之间的距离。我有一些文件看起来像这样：

ATOM   1592 HD13 LEU D  46      11.698 -10.914   2.183  1.00  0.00           H  
ATOM   1593 HD21 LEU D  46      11.528  -8.800   5.301  1.00  0.00           H  
ATOM   1594 HD22 LEU D  46      12.997  -9.452   4.535  1.00  0.00           H  
ATOM   1595 HD23 LEU D  46      11.722  -8.718   3.534  1.00  0.00           H  
HETATM 1597  N1  308 A   1       0.339   6.314  -9.091  1.00  0.00           N  
HETATM 1598  C10 308 A   1      -0.195   5.226  -8.241  1.00  0.00           C  
HETATM 1599  C7  308 A   1      -0.991   4.254  -9.133  1.00  0.00           C  
HETATM 1600  C1  308 A   1      -1.468   3.053  -8.292  1.00  0.00           C

所以我想计算ATOM1和所有其他HETATM1之间，ATOM1和所有其他'HETATM2'之间的距离等等。我在perl中编写了一个脚本，但我无法弄清楚脚本出了什么问题，它没有给我任何错误它只是没有打印任何东西。

我也不确定如何在脚本中添加它，如果可能的话，如果每次计算的结果都超过5，那么删除这两行计算中包含的行。如果是<=，那么5然后保留它。

#!/usr/local/bin/perl 

    use strict;
    use warnings;

    open(IN, $ARGV[0]) or die "$!"; 
    my (@refer, @points);
    my $part = 0;
    my $dist;
    while (my $line = <IN>) { 
        chomp($line);
        if ($line =~ /^HETATM/) {
            $part++;
            next;
        }
        my @array = (substr($line, 30, 8),substr($line,38,8),substr($line,46,8));
    #    print "@array\n";
        if ($part == 0) {
            push @refer, [ @array ]; 
        } elsif ($part ==1){
            push @points, [ @array ]; 
        }
    }

        foreach my $ref(@refer) {
        my ($x1, $y1, $z1) = @{$ref};
        foreach my $atom(@points) {
            my ($x, $y, $z) = @{$atom};
            my $dist = sqrt( ($x-$x1)**2 + ($y-$y1)**2 + ($z-$z1)**2 );
        print $dist;

        }

    }

Answer 1

当看到HETATM的行时，您会增加$part并跳到下一个输入行。您的数组@refer将为空。

在递增next;后删除$part行。

您的测试应为} elsif( $part ) { ... }，因为您为$part的每一行增加了HETATM。

Answer 2

好的，我必须说 - 我重写了你的代码，工作方式有所不同。

这样的事情：

#!/usr/bin/env perl
use strict;
use warnings;

use Data::Dumper;

my %coordinates; 
#use types to track different types. Unclear if you need to handle anything other than 'not ATOM' but this is in case you do. 

my %types; 

#read STDIN or files specified on command line - like how grep/sed do it. 
while ( <> ) {
   my ( $type, $id, undef, undef, undef, undef, $x, $y, $z ) = split; # splits on white space. 
   $coordinates{$type}{$id} = [$x, $y, $z];
   $types{$type}++ if $type ne 'ATOM'; 
}

#print for debugging:
print Dumper \%coordinates;
print Dumper \%types;

#iterate each element of "ATOM"
foreach my $atom_id ( keys %{$coordinates{'ATOM'}} ) { 
   #iterate all the types (HETATM)
   foreach my $type ( sort keys %types ) { 
      #iterate each id within the data structure. 
      foreach my $id ( sort keys %{$coordinates{$type}} ) { 

         my $dist = 0;
         #take square of x - x1, y - y1, z - z1
         #do it iteratively, using 'for' loop.
         $dist += (($coordinates{$type}{$id}[$_] - $coordinates{'ATOM'}{$atom_id}[$_]) ** 2) for 0..2; 
         $dist = sqrt $dist; 

         print "$atom_id -> $type $id $dist\n";
      }

这是：

使用<>在命令行上读取STDIN或命名文件，而不是手动打开ARGV[0]来完成类似的结果。（但也意味着你也可以通过它来管道）。
首先将数据读入哈希值。
然后迭代所有可能的配对，计算你的距离。
如果符合条件，则有条件地打印（所有结果似乎都是？）

这给出了结果：

1592 -> HETATM 1597 23.5145474334506
1592 -> HETATM 1598 22.5965224094328
1592 -> HETATM 1599 22.7844420822631
1592 -> HETATM 1600 21.8665559702483
1595 -> HETATM 1597 22.6919443415499
1595 -> HETATM 1598 21.7968036647578
1595 -> HETATM 1599 22.1437585337268
1595 -> HETATM 1600 21.2693868505888
1594 -> HETATM 1597 24.3815421169376
1594 -> HETATM 1598 23.509545380547
1594 -> HETATM 1599 23.8816415683679
1594 -> HETATM 1600 23.0248383056212
1593 -> HETATM 1597 23.6802952050856
1593 -> HETATM 1598 22.74957513889
1593 -> HETATM 1599 23.1402816102138
1593 -> HETATM 1600 22.2296935201545

现在你提到要删除过多的行＆＃39; - 这有点复杂，因为你有一个复合标准（并且你将删除你的所有行）。

问题是 - 你不知道你的ATOM行是否有太多＆＃34;距离＆＃34;直到您测试了文件中的每个配对。

您可以通过以下方式执行此操作：

#iterate each element of "ATOM"
foreach my $atom_id ( keys %{$coordinates{'ATOM'}} ) { 
   #iterate all the types (HETATM)
   foreach my $type ( sort keys %types ) { 
      #iterate each id within the data structure. 
      foreach my $id ( sort keys %{$coordinates{$type}} ) { 

         my $dist = 0;
         #take square of x - x1, y - y1, z - z1
         #do it iteratively, using 'for' loop.
         $dist += (($coordinates{$type}{$id}[$_] - $coordinates{'ATOM'}{$atom_id}[$_]) ** 2) for 0..2; 
         $dist = sqrt $dist; 

         print "### $atom_id -> $type $id $dist\n";

         ##note - this will print out multiple times if there's multiple pairings. 
         if ( $dist <= 5 ) {
            print $lines{'ATOM'}{$atom_id};
            print $lines{$type}{$id};
         }
      }
   }
}

对于每个配对比较，将打印距离为＆lt; = 5的ATOM和HETATM行。但如果存在多个配对，则会多次执行此操作。

但我认为您的核心错误是错误处理$part和next条款。

您只需递增$part并在0初始化它时，您永远不会将其重置为零。因此，对于每个连续的HETATM，它将是1,2,3,4。
在递增next后使用part，这意味着您完全跳过if ( $part == 1子句。

Answer 3

我会用这种方法：

#!/usr/bin/env perl 

use strict;
use warnings;

use constant POZITION => ( 6, 7, 8 );    # X, Y, Z

sub dist {
    my ( $a, $b ) = @_;
    my $s = 0;
    for my $i ( 0 .. $#$a ) {
        $s += ( $a->[$i] - $b->[$i] )**2;
    }
    return sqrt($s);
}

# Record format
use constant {
    LINE => 0,
    POZ  => 1,
    KEEP => 2,
};

my ( @refer, @points );
while ( my $line = <> ) {
    my ( $type, @poz ) = ( split ' ', $line )[ 0, POZITION ];
    print STDERR join( ',', $type, @poz ), "\n";
    if ( $type eq 'ATOM' ) {
        push @refer, [ $line, \@poz ];
    }
    elsif ( $type eq 'HETATM' ) {
        push @points, [ $line, \@poz ];
    }
}

for my $ref (@refer) {
    for my $atom (@points) {
        my $dist = dist( $ref->[POZ], $atom->[POZ] );
        print STDERR "$ref->[LINE]$atom->[LINE]dist: $dist\n";
        next if $dist > 5;
        $ref->[KEEP]  ||= 1;
        $atom->[KEEP] ||= 1;
    }
}

print $_->[LINE] for grep $_->[KEEP], @refer, @points;

很遗憾，您的数据不包含任何距离＆lt; = 5的ATOM和HETATM对。（请注意，split ' '是分词。这意味着{{1省略任何前导和尾随空格。）

它作为一个过滤器，调试输出到STDERR。

计算蛋白质和配体之间的距离

3 个答案: