查找文本序列并使用替换文本创建新文件

时间:2014-10-13 17:08:19

标签: perl replace

我正在尝试找到一种编写执行以下操作的脚本的方法:

  • 打开并检测首次使用在输入文件中重复的三字母序列

  • 编辑并置换这三个字母序列19次,给出19个输出,每个输出带有不同的三个字母代码,对应于19个可能的三个字母代码列表

基本上,这是一个相当简单的查找和替换问题,我知道该怎么做。问题是我需要循环这个,以便在从上一行创建19个文件后,带有不同三个字母代码的下一行对它进行相同的替换。

我很难找到让脚本识别文本序列的方法,因为它可能是二十种不同的东西之一。

如果有人对我如何做到这一点有任何想法,请告诉我,如果有必要,我会提供任何澄清!

以下是输入文件的示例:

ATOM      1  N   SER A   2      37.396  -5.247  -4.830  1.00 65.06           N  
ATOM      2  CA  SER A   2      37.881  -6.354  -3.929  1.00 64.88           C  
ATOM      3  C   SER A   2      36.918  -7.555  -3.786  1.00 64.14           C  
ATOM      4  O   SER A   2      37.287  -8.576  -3.177  1.00 64.31           O  
ATOM      5  CB  SER A   2      38.251  -5.804  -2.552  1.00 65.31           C  
ATOM      6  OG  SER A   2      37.122  -5.210  -1.918  1.00 66.94           O  
ATOM      7  N   GLU A   3      35.705  -7.438  -4.342  1.00 62.82           N  
ATOM      8  CA  GLU A   3      34.716  -8.539  -4.306  1.00 61.94           C  
ATOM      9  C   GLU A   3      35.126  -9.833  -5.033  1.00 59.71           C  
ATOM     10  O   GLU A   3      34.927 -10.911  -4.473  1.00 59.23           O  
ATOM     11  CB  GLU A   3      33.328  -8.094  -4.789  1.00 62.49           C  
ATOM     12  CG  GLU A   3      32.291  -7.994  -3.693  1.00 66.67           C  
ATOM     13  CD  GLU A   3      31.552  -9.302  -3.426  1.00 71.93           C  
ATOM     14  OE1 GLU A   3      32.177 -10.254  -2.892  1.00 73.96           O  
ATOM     15  OE2 GLU A   3      30.329  -9.364  -3.723  1.00 74.25           O  
ATOM     16  N   PRO A   4      35.663  -9.732  -6.280  1.00 57.83           N  
ATOM     17  CA  PRO A   4      36.131 -10.951  -6.967  1.00 56.64           C  

输出如下所示:

ATOM      1  N   ALA A   2      37.396  -5.247  -4.830  1.00 65.06           N  
ATOM      2  CA  SER A   2      37.881  -6.354  -3.929  1.00 64.88           C  
ATOM      3  C   SER A   2      36.918  -7.555  -3.786  1.00 64.14           C  
ATOM      4  O   SER A   2      37.287  -8.576  -3.177  1.00 64.31           O  
ATOM      5  CB  SER A   2      38.251  -5.804  -2.552  1.00 65.31           C  
ATOM      6  OG  SER A   2      37.122  -5.210  -1.918  1.00 66.94           O  
ATOM      7  N   GLU A   3      35.705  -7.438  -4.342  1.00 62.82           N  
ATOM      8  CA  GLU A   3      34.716  -8.539  -4.306  1.00 61.94           C  
ATOM      9  C   GLU A   3      35.126  -9.833  -5.033  1.00 59.71           C  
ATOM     10  O   GLU A   3      34.927 -10.911  -4.473  1.00 59.23           O  
ATOM     11  CB  GLU A   3      33.328  -8.094  -4.789  1.00 62.49           C          
ATOM     12  CG  GLU A   3      32.291  -7.994  -3.693  1.00 66.67           C  
ATOM     13  CD  GLU A   3      31.552  -9.302  -3.426  1.00 71.93           C  
ATOM     14  OE1 GLU A   3      32.177 -10.254  -2.892  1.00 73.96           O  
ATOM     15  OE2 GLU A   3      30.329  -9.364  -3.723  1.00 74.25           O  
ATOM     16  N   PRO A   4      35.663  -9.732  -6.280  1.00 57.83           N  
ATOM     17  CA  PRO A   4      36.131 -10.951  -6.967  1.00 56.64           C  

在第一遍中,SER应更改为一系列二十个不同的文本序列,第一个是ALA。我遇到的问题是,我不确定如何编写一个可以更改多行文本的脚本。

我当前的脚本可以形成第一个SER的19个突变,但这就是它将停止的地方。它不会改变下一个,它不会改变不同的三字母代码,例如它不会改变GLU。有没有简单的方法来集成这个功能?

目前,我接近这个的方法是使用sed进行简单的文本转换,但是因为这看起来比sed带来的更复杂,我认为perl可能是要走的路。我可以添加sed代码,但我认为它不会有太大帮助。

2 个答案:

答案 0 :(得分:1)

您的问题和评论并不完全清楚,但我相信这个脚本可以满足您的需求。它解析PDB文件,直到它到达感兴趣的氨基酸。产生一组19个文件,其中AA被其他19个AA替代。从那时起,每当AA与前一行中的AA不同时,将生成另一组19个文件。

#!/usr/bin/perl
use warnings;
use strict;

# we're going to start mutating when we find this residue.
my $target = 'GLU';

my @aas = ( 'ALA', 'ARG', 'ASN', 'ASP', 'CYS', 'GLU', 'GLN', 'GLY', 'HIS', 'ILE', 'LEU', 'LYS', 'MET', 'PHE', 'PRO', 'SER', 'THR', 'TRP', 'TYR', 'VAL' );

my $prev = '';
my $line_no = 0;
my @lines;
my %changes;

# uncomment the following lines and comment out "while (<DATA>) {"
# to read the input from a file

# my $input = 'path/to/pdb_file';
# open( my $fh, "<", $input ) or die "Could not open $input: $!";
# while (<$fh>) {
while (<DATA>) {
    # split the line into columns (assuming it is tab-delimited;
    # switch this for "\s+" if it is separated with whitespace.
    my @cols = split "\t";

    if ($target && $cols[3] eq $target) {
        # Found our target residue! unset $target so that the following
        # set of tests are performed
        undef $target;
    }

    # see if this AA is the same as the AA in the previous line
    if (! $target && $prev ne $cols[3]) {
        # if it isn't, store the line number and the amino acid
        $changes{ $line_no } = $cols[3];
        # update $prev to reflect the new AA
        $prev = $cols[3];
    }
    # store all the lines
    push @lines, $_;
    # increment the line number
    $line_no++;
}

# now, for each of the changes, create substitute files
for (keys %changes) {
    create_substitutes($_, $changes{$_}, [@aas], [@lines]);
}

sub create_substitutes {
    # arguments: line no, $res: residue, $aas: array of amino acids,
    # $all_lines: all lines in the file
    my ($line_no, $res, $aas, $all_lines) = @_;

    # this is the target line that we want to substitute
    my @target = split "\t", $all_lines->[$line_no];

    # for each AA in the list of AAs, create a new file called 'XXX-##.txt',
    # where XXX is the amino acid and ## is the line number where the
    # substituted residue is.
    for (@$aas) {
        next if $_ eq $res;
        open( my $fh, ">", $_."-$line_no.txt") or die "Could not create output file for $_: $!";
        # print out all lines up to the changed line
        print { $fh } @$all_lines[0..$line_no-1];
        # print out the changed line, substituting in the AA
        print { $fh } join "\t", @target[0..2], $_, @target[4..$#target];
        # print out the rest of the lines.
        print { $fh } @$all_lines[$line_no+1 .. $#{$all_lines}];
    }
}


__DATA__
ATOM    1   N   SER A   2   37.396  -5.247  -4.830  1.00    65.06   N
ATOM    2   CA  SER A   2   37.881  -6.354  -3.929  1.00    64.88   C
ATOM    3   C   SER A   2   36.918  -7.555  -3.786  1.00    64.14   C
ATOM    4   O   SER A   2   37.287  -8.576  -3.177  1.00    64.31   O
ATOM    5   CB  SER A   2   38.251  -5.804  -2.552  1.00    65.31   C
ATOM    6   OG  SER A   2   37.122  -5.210  -1.918  1.00    66.94   O
ATOM    7   N   GLU A   3   35.705  -7.438  -4.342  1.00    62.82   N
ATOM    8   CA  GLU A   3   34.716  -8.539  -4.306  1.00    61.94   C
ATOM    9   C   GLU A   3   35.126  -9.833  -5.033  1.00    59.71   C
ATOM    10  O   GLU A   3   34.927  -10.911 -4.473  1.00    59.23   O
ATOM    11  CB  GLU A   3   33.328  -8.094  -4.789  1.00    62.49   C
ATOM    12  CG  GLU A   3   32.291  -7.994  -3.693  1.00    66.67   C
ATOM    13  CD  GLU A   3   31.552  -9.302  -3.426  1.00    71.93   C
ATOM    14  OE1 GLU A   3   32.177  -10.254 -2.892  1.00    73.96   O
ATOM    15  OE2 GLU A   3   30.329  -9.364  -3.723  1.00    74.25   O
ATOM    16  N   PRO A   4   35.663  -9.732  -6.280  1.00    57.83   N
ATOM    17  CA  PRO A   4   36.131  -10.951 -6.967  1.00    56.64   C
ATOM    18  CA  ARG A   4   36.131  -10.951 -6.967  1.00    56.64   C

此示例数据将为找到的第一个GLU(第6行)生成一组文件,然后为第15行(PRO残留)生成另一个文件,为第17行(ARG残留)生成另一个文件。

ALA-6.txt文件示例:

ATOM    1   N   SER A   2   37.396  -5.247  -4.830  1.00    65.06   N
ATOM    2   CA  SER A   2   37.881  -6.354  -3.929  1.00    64.88   C
ATOM    3   C   SER A   2   36.918  -7.555  -3.786  1.00    64.14   C
ATOM    4   O   SER A   2   37.287  -8.576  -3.177  1.00    64.31   O
ATOM    5   CB  SER A   2   38.251  -5.804  -2.552  1.00    65.31   C
ATOM    6   OG  SER A   2   37.122  -5.210  -1.918  1.00    66.94   O
ATOM    7   N   ALA A   3   35.705  -7.438  -4.342  1.00    62.82   N
ATOM    8   CA  GLU A   3   34.716  -8.539  -4.306  1.00    61.94   C
ATOM    9   C   GLU A   3   35.126  -9.833  -5.033  1.00    59.71   C

(等)

如果这不是正确的行为,您将不得不编辑您的问题,因为它不是很清楚!

答案 1 :(得分:1)

因为你的问题不是很清楚(更准确地说,它完全不清楚),我创造了以下内容:

#!/usr/bin/env perl

use 5.014;
use strict;
use warnings;
use Path::Tiny;
use Bio::PDB::Structure;
use Data::Dumper;

my $residues_file = "input2.txt";   #residue names, one per line
my $molfile = "m1.pdb";             #molecule file

#read the residues
my(@residues) = path($residues_file)->lines({chomp => 1});

my $m= Bio::PDB::Structure::Molecule->new;

for my $res (@residues) {       #for each residue name from a file "input2.txt"
    $m->read("m1.pdb");         #read the molecule
    my $atom = $m->atom(0);     #get the 1st atom
    $atom->residue_name($res);  #change the residue to the from file

    #create output filename
    my $outfile = path($molfile)->basename('.pdb') . '_' . lc($res) . '.pdb';
    #write the result
    $m->print($outfile);
}

例如,如果input2.txt包含

ALA
ARG
ASN
ASP
CYS
GLN
GLU
GLY
HIS
ILE
LEU
LYS
MET
PHE
PRO
SER
THR
TRP
TYR
VAL

从您的输入中,生成20个文件,其中第1个原子中的残基被更改(根据您的输出示例),以便:

==> m1_ala.pdb <==
ATOM      1  N   ALA A   2      37.396  -5.247  -4.830  1.00 65.06

==> m1_arg.pdb <==
ATOM      1  N   ARG A   2      37.396  -5.247  -4.830  1.00 65.06

==> m1_asn.pdb <==
ATOM      1  N   ASN A   2      37.396  -5.247  -4.830  1.00 65.06

==> m1_asp.pdb <==
ATOM      1  N   ASP A   2      37.396  -5.247  -4.830  1.00 65.06

==> m1_cys.pdb <==
ATOM      1  N   CYS A   2      37.396  -5.247  -4.830  1.00 65.06

...等,20次......