将满足特定条件的整行打印到不同的文件中

时间:2018-01-31 03:04:42

标签: perl

我正在尝试计算数组中相同元素重复次数的次数,如果它重复更多时间,那么我想将整行打印到不同的文件中。

我的意见:

ATM 4387 FE   HEM A 142       
ATM 4388  CHA HEM A 142      
ATM 4389  CHB HEM A 142      
ATM 4431  CHA HEM B 147     
ATM 4432  CHB HEM B 147     
ATM 4433  CHC HEM B 147     
ATM 4434  CHD HEM B 147     
ATM 4559  O   HOH A 156     
ATM 4560  O   HOH A 159

所以我将元素[3][4][5]放入一个单独的数组中,计算其外观的数量并设置条件,如果它出现>1那么将它们打印成单独的文件。该脚本的另一部分是匹配数组@ligligands.txt file)中的元素和@ligands_pdb数组中的元素。如果它似乎匹配,则@ligands_pdb中的元素也应包含在文件名中。

我的@lig数组如下所示:

HEC
HEM
HEP
IGP
IPM
LLP

因为HEM匹配,所以这也应该包含在文件名中。 我得到的当前错误是Use of uninitialized value $ligands_pdb in concatenation (.) or string at example.pl line 58, <$_[...]> line 5436.

  #! usr/bin/env perl

use strict;
use warnings;
use autodie;
use 5.010;
use Data::Dumper;

my $data;
my $ligands_pdb;
my @ligands_pdb;

my $ligand_file = 'ligands.txt';
open (LIG, $ligand_file)or die "Cannot open $ligand_file, $!";
my @lig= <LIG>;
close LIG;
#print "@lig\n";
my $flag = 0;
for my $pdb ( glob '*pdb' ) 
{
    #printf "# %s\n", $pdb;
    open my $fh, "<", $pdb;
    for my $line ( <$fh> ) 
{
        chomp( $line );
    if ( $line =~ m/^ATM / ) 
    {
        my @cols = split ' ', $line;
        #print @cols;
        #print "$cols[3]\n";
        push @ligands_pdb, $cols[3];
        my ($chain_id, $res_no) = ( $cols[4], $cols[5] );
            defined $res_no
            or die "Unable to grok line: $line";
            push @{ $data->{$chain_id}->{$res_no} }, $line;
     }

    foreach (@ligands_pdb)
    {
        if ("@lig" =~/$_/ )
        {
            $flag = 0;
        }
        else
        {
            $flag = 1;
        }
     for my $chain_id ( keys %$data ) 
    {
        for my $res_no ( keys %{ $data->{$chain_id} } ) 
        {
        #print "$chain_id\n";
        #print "$res_no\n";
        my @lines = @{ $data->{$chain_id}->{$res_no} };
                if ( $flag ==0 and scalar @lines > 1 ) 
        {
                    open my $out, ">> $ligands_pdb . '#' . $chain_id . '#' . $res_no . '.txt';";    #line 58
                    print $out $_ for (@lines);
                    close $out;
        }
        @ligands_pdb = ();
        }
    }
    }
}
}

我希望创建2个文件,其内容为:

HEM#A#142:

ATM 4387 FE   HEM A 142       
ATM 4388  CHA HEM A 142      
ATM 4389  CHB HEM A 142

HEM#B#147:

ATM 4431  CHA HEM B 147     
ATM 4432  CHB HEM B 147     
ATM 4433  CHC HEM B 147     
ATM 4434  CHD HEM B 147

1 个答案:

答案 0 :(得分:1)

我会使用嵌套哈希重写您的代码来存储文件行,键入2个字段。如果存储了多行,则保存到新文件。我添加了一些调试,以便您可以看到流程。

<强> filter.pl

#!/usr/bin/env perl

use strict;
use warnings;
use autodie;
use 5.010;

use List::Util qw( uniq );

my $DEBUG = 1;
sub debug {
    my ($msg) = @_;
    print "DEBUG: $msg\n" if $DEBUG;
}

my $ligand_file = 'ligands.txt';
open( my $LIG, $ligand_file ) or die "Cannot open $ligand_file, $!";
debug("Reading ligands file: $ligand_file");
my %ligands_hash;
for my $ligand ( <$LIG> ) {
    chomp( $ligand ); # Remove trailing newline
    $ligands_hash{ $ligand } = 1;
}
close $LIG;
debug("Found ligands: " . join(',',sort keys %ligands_hash));

my %output_files;
my $flag = 0;
for my $pdb ( glob '*pdb' ) {

    my %ligands_found;
    my $data_hash_ref;

    debug("-"x40);
    debug("Working on file $pdb");
    open my $fh, "<", $pdb;
    for my $line (<$fh>) {
        chomp($line);
        if ( $line =~ m/^ATM / ) {
            $line =~ s|\s*$||;
            debug("--> Found an ATM line");
            my @cols = split ' ', $line;
            my ( $ligand, $chain_id, $res_no ) = ( $cols[3], $cols[4], $cols[5] );
            debug("--> Adding ligand $ligand to ligands_found hash");
           $ligands_found{ $ligand }++;

            defined $res_no
              or die "Unable to grok line: $line";

            # This works because perl automatically creates the missing
            # parts of nested hash (this is known as Autovivication).
            # The last part, the array is also created by the attempt
            # to push onto it, so perl assumes it should exist.
            push @{ $data_hash_ref->{$chain_id}->{$res_no} }, $line;
        }
    }

    debug("Processing ligands");
    for my $ligand (sort keys %ligands_found) {
        $flag = defined $ligands_hash{$ligand} ? 0 : 1;
        debug("--> Ligand $ligand, flag = $flag");

        for my $chain_id ( keys %$data_hash_ref ) {
            for my $res_no ( keys %{ $data_hash_ref->{$chain_id} } ) {
                debug("------> Chain Id = $chain_id, Res No = $res_no");
                my @lines = @{ $data_hash_ref->{$chain_id}->{$res_no} };
                if ( $flag == 0 and scalar @lines > 1 ) {

                    # Output filename based on first ligand with $chain_id and $res_no combo
                    my $id = join ':', $chain_id, $res_no;
                    my $outfile = $output_files{$id} ||= join( '#', $ligand, $chain_id, $res_no ) . '.txt';
                    my $nl = (scalar @lines);
                    my $nl_desc = "$nl line" . ($nl > 1 ? "s" : "");
                    debug("------> Appending $nl_desc to $outfile");
                    open my $out, ">> $outfile";
                    print $out "$_\n" for (uniq @lines);
                    close $out;

                    # Remove the lines so they don't get printed twice.
                    undef @{ $data_hash_ref->{$chain_id}->{$res_no} };
                }
            }
        }
    }
}

<强> intput.pdb

ATM 4387 FE   HEM A 142       
ATM 4388  CHA HEM A 142      
ATM 4389  CHB HEM A 142      
ATM 4431  CHA HEM B 147     
ATM 4432  CHB HEM B 147     
ATM 4433  CHC IGP B 147     
ATM 4434  CHD IGP B 147     
ATM 4559  O   HOH A 156     
ATM 4560  O   HOH A 159

<强> HEM#A#142.txt

ATM 4387 FE   HEM A 142
ATM 4388  CHA HEM A 142
ATM 4389  CHB HEM A 142

<强> HEM#B#147.txt

ATM 4431  CHA HEM B 147
ATM 4432  CHB HEM B 147
ATM 4433  CHC IGP B 147
ATM 4434  CHD IGP B 147