我非常抱歉在几个问题上困扰你我的问题,但我需要解决它......
我想从包含字符串的文件中提取几个子字符串,方法是使用另一个文件,其中包含我要提取的每个子字符串的开头和结尾。 第一个文件是:
>scaffold30 24194
CTTAGCAGCAGCAGCAGCAGTGACTGAAGGAACTGAGAAAAAGAGCGAGCTGAAAGGAAGCATAGCCATTTGGGAGTGCCAGAGAGTTGGGAGG GAGGGAGGGCAGAGATGGAAGAAGAAAGGCAGAAATACAGGGAGATTGAGGATCACCAGGGAG.........
.................
(字符串必须是文件中除第一行以外的所有内容),坐标文件如下:
44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
.....................
我的脚本是这样的:
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $count = 0;
while(<INFILE1>) {
if ($_ !~ m/^>/) {
local $/ = undef;
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my @cline = <INFILE>;
foreach my $cline (@cline) {
print "$cline\n";
my@data = split('\t', $cline);
my $start = $data[0];
my $end = $data[1];
my $offset = $end - $start;
$count++;
my $sub = substr ($var, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE;
}
}
当我运行它时,看起来它只进行了一次迭代,它会打印出第一个文件的开头。 似乎foreach循环不起作用。 还有substr似乎不起作用。 当我打开一个出口来打印cline以检查循环时,它会用坐标打印文件的所有行。
我很抱歉,如果我变得讨厌,但我必须完成它,我有点绝望...
再次感谢你。
答案 0 :(得分:2)
这一行
local $/ = undef;
更改整个封闭块的$/
,其中包括您在第二个文件中读取的部分。 $/
是输入记录分隔符,它基本上定义了“行”的内容(默认情况下是新行,有关详细信息,请参阅perldoc perlvar
)。当您使用<>
从文件句柄中读取时,$/
用于确定停止阅读的位置。例如,以下程序依赖于默认的行拆分行为,因此只读取直到第一个换行符:
my $foo = <DATA>;
say $foo;
# Output:
# 1
__DATA__
1
2
3
虽然这个程序一直读到EOF:
local $/;
my $foo = <DATA>;
say $foo;
# Output:
# 1
# 2
# 3
__DATA__
1
2
3
这意味着您的@cline
数组只获得一个元素,这是一个包含整个坐标文件文本的字符串。您可以使用Data::Dumper
:
use Data::Dumper;
print Dumper(\@cline);
在你的情况下会输出如下内容:
$VAR1 = [
'44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
'
];
注意[
和]
描述的数组(在本例中是技术上的arrayref)只包含一个元素,它是一个包含换行符的字符串(用单引号括起来)。
让我们来看看代码的相关部分:
while(<INFILE1>) {
if ($_ !~ m/^>/) {
# Enable localized slurp mode. Stays in effect until we leave the 'if'
local $/ = undef;
# Read the rest of INFILE1 into $var (from current line to EOF)
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
# In list context, return each block until the $/ character as a
# separate list element. Since $/ is still undef, this will read
# everything until EOF into our first list element, resulting in
# a one-element array
my @cline = <INFILE>;
# Since @cline only has one element, the loop only has one iteration
foreach my $cline (@cline) {
作为旁注,您的代码可能会被清理一下。您为文件句柄选择的名称会留下一些需要的东西,您应该使用词法文件句柄(以及open
的三参数形式):
open my $chromosome_fh, "<", $ARGV[0] or die $!;
open my $coordinates_fh, "<", $ARGV[1] or die $!;
此外,在这种情况下,您不需要嵌套循环,它只会使您的代码更加复杂。首先将染色体文件的相关部分读入变量(名称比var
更有意义):
# Get rid of the `local $/` statement, we don't need it
my $chromosome;
while (<$chromosome_fh>) {
next if /^>/;
$chromosome .= $_;
}
然后读入您的坐标文件:
my @cline = <$coordinates_fh>;
或者,如果您只需要使用坐标文件的内容一次,请使用while循环处理每一行:
while (<$coordinates_fh>) {
# Do something for each line here
}
答案 1 :(得分:1)
由于'ThisSuitIsBlackNot'建议,您的代码可以稍微清理一下。这是一个可能你想要的解决方案。
#!/usr/bin/perl
use strict;
use warnings;
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $fasta;
<INFILE1>; # get rid of the first line - '>scaffold30 24194'
while(<INFILE1>) {
chomp;
$fasta .= $_;
}
close INFILE1 or die "Could not close '$chrom'. $!";
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my $count = 0;
while(<INFILE>) {
my ($start, $end) = split;
# Or, should this be: my $offset = $end - ($start - 1);
# That would include the start fasta
my $offset = $end - $start;
$count++;
my $sub = substr ($fasta, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE or die "Could not close '$coords_file'. $!";