如何通过知道坐标来提取子串

时间:2013-08-29 15:37:44

标签: perl

我非常抱歉在几个问题上困扰你我的问题,但我需要解决它......

我想从包含字符串的文件中提取几个子字符串,方法是使用另一个文件,其中包含我要提取的每个子字符串的开头和结尾。 第一个文件是:

>scaffold30     24194
CTTAGCAGCAGCAGCAGCAGTGACTGAAGGAACTGAGAAAAAGAGCGAGCTGAAAGGAAGCATAGCCATTTGGGAGTGCCAGAGAGTTGGGAGG GAGGGAGGGCAGAGATGGAAGAAGAAAGGCAGAAATACAGGGAGATTGAGGATCACCAGGGAG.........
.................

(字符串必须是文件中除第一行以外的所有内容),坐标文件如下:

44801988    44802104
44846151    44846312
45620133    45620274
45640443    45640543
45688249    45688358
45729531    45729658
45843362    45843490
46066894    46066996
46176337    46176464
.....................

我的脚本是这样的:

my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];

#finds  subsequences: fasta files



open INFILE1, $chrom or die "Could not open $chrom: $!";
my $count = 0;

while(<INFILE1>) {
    if ($_ !~ m/^>/) {

    local $/ = undef;
    my $var = <INFILE1>;

    open INFILE, $coords_file or die "Could not open $coords_file: $!";
           my @cline = <INFILE>;
    foreach my $cline (@cline) {
    print "$cline\n";
            my@data = split('\t', $cline);
            my $start = $data[0];
            my $end = $data[1];
            my $offset = $end - $start;
           $count++;
           my $sub = substr ($var, $start, $offset);
           print ">conserved $count\n";
           print "$sub\n";

    }
    close INFILE;
    }
}

当我运行它时,看起来它只进行了一次迭代,它会打印出第一个文件的开头。 似乎foreach循环不起作用。 还有substr似乎不起作用。 当我打开一个出口来打印cline以检查循环时,它会用坐标打印文件的所有行。

我很抱歉,如果我变得讨厌,但我必须完成它,我有点绝望...

再次感谢你。

2 个答案:

答案 0 :(得分:2)

这一行

local $/ = undef;

更改整个封闭块的$/,其中包括您在第二个文件中读取的部分。 $/是输入记录分隔符,它基本上定义了“行”的内容(默认情况下是新行,有关详细信息,请参阅perldoc perlvar)。当您使用<>从文件句柄中读取时,$/用于确定停止阅读的位置。例如,以下程序依赖于默认的行拆分行为,因此只读取直到第一个换行符:

my $foo = <DATA>;
say $foo;
# Output:
# 1

__DATA__
1
2
3

虽然这个程序一直读到EOF:

local $/;
my $foo = <DATA>;
say $foo;
# Output:
# 1
# 2
# 3

__DATA__
1
2
3

这意味着您的@cline数组只获得一个元素,这是一个包含整个坐标文件文本的字符串。您可以使用Data::Dumper

查看此信息
use Data::Dumper;

print Dumper(\@cline);

在你的情况下会输出如下内容:

$VAR1 = [
          '44801988    44802104
44846151    44846312
45620133    45620274
45640443    45640543
45688249    45688358
45729531    45729658
45843362    45843490
46066894    46066996
46176337    46176464
'
        ];

注意[]描述的数组(在本例中是技术上的arrayref)只包含一个元素,它是一个包含换行符的字符串(用单引号括起来)。

让我们来看看代码的相关部分:

while(<INFILE1>) {
    if ($_ !~ m/^>/) {
        # Enable localized slurp mode. Stays in effect until we leave the 'if'
        local $/ = undef;

        # Read the rest of INFILE1 into $var (from current line to EOF)
        my $var = <INFILE1>;

        open INFILE, $coords_file or die "Could not open $coords_file: $!";

        # In list context, return each block until the $/ character as a
        # separate list element. Since $/ is still undef, this will read
        # everything until EOF into our first list element, resulting in
        # a one-element array
        my @cline = <INFILE>;

        # Since @cline only has one element, the loop only has one iteration
        foreach my $cline (@cline) {

作为旁注,您的代码可能会被清理一下。您为文件句柄选择的名称会留下一些需要的东西,您应该使用词法文件句柄(以及open的三参数形式):

open my $chromosome_fh,  "<", $ARGV[0] or die $!;
open my $coordinates_fh, "<", $ARGV[1] or die $!;

此外,在这种情况下,您不需要嵌套循环,它只会使您的代码更加复杂。首先将染色体文件的相关部分读入变量(名称比var更有意义):

# Get rid of the `local $/` statement, we don't need it
my $chromosome;
while (<$chromosome_fh>) {
    next if /^>/;
    $chromosome .= $_;
}

然后读入您的坐标文件:

my @cline = <$coordinates_fh>;

或者,如果您只需要使用坐标文件的内容一次,请使用while循环处理每一行:

while (<$coordinates_fh>) {
    # Do something for each line here
}

答案 1 :(得分:1)

由于'ThisSuitIsBlackNot'建议,您的代码可以稍微清理一下。这是一个可能你想要的解决方案。

#!/usr/bin/perl
use strict;
use warnings;

my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];

#finds  subsequences: fasta files

open INFILE1, $chrom or die "Could not open $chrom: $!";
my $fasta;

<INFILE1>; # get rid of the first line - '>scaffold30     24194'

while(<INFILE1>) {
    chomp;
    $fasta .= $_;
}
close INFILE1 or die "Could not close '$chrom'. $!";

open INFILE, $coords_file or die "Could not open $coords_file: $!";
my $count = 0;

while(<INFILE>) {
    my ($start, $end) = split;

    # Or, should this be: my $offset = $end - ($start - 1);
    # That would include the start fasta
    my $offset = $end - $start;

    $count++;
    my $sub = substr ($fasta, $start, $offset);
    print ">conserved $count\n";
    print "$sub\n";
}
close INFILE or die "Could not close '$coords_file'. $!";