调试Perl程序:哈希,写入和访问文件句柄

时间:2012-07-14 02:46:40

标签: perl debugging hash bioinformatics filehandle

我正在开展一项生物信息学项目,该项目涉及将不同的脚本和输入参数组合在一起,以分析下一代测序Illumina数据。我需要帮助调试第一个脚本。它的任务是解析qseq文件,将'good'样本转换为fastq格式,并将输出保存到临时txt文件(到磁盘)。

出于调试的目的,管道方案如下:

# the input parameters are "fed" into the script and the output is written
# to tmp.txt
script01.pl [input parameters] > tmp.txt

我在终端中键入此命令,然后查看tmp.txt文件以检查脚本是否输出预期结果。但对于整个项目,我有一个称为包装脚本的东西,它将所有脚本连接在一起。 Script01会将输出保存为end1和end2数据的tmp文件,因为它们需要分别由脚本02-06处理。

这是代码。我添加了描述性评论,以便您了解正在发生的事情。另外,我没有qseq文件向您展示,只是假设我正在正确解析字段:

#!/usr/bin/perl
use strict; use warnings;

my $end1_temp=shift; # this variable is the location of the temporary file
my $end2_temp=shift; # this variable is the location of the temporary file
my $qseq_file=shift;
my $barcode=shift;

# declare and initialize variables
my $overhang= 'C[AT]GC';
my $end1_trim_offset= length($barcode)+4;
my $end2_trim_offset= 4; 

my %to_keep=(); # an empty hash
my @line=();    # an empty array

# open the qseq file, or exit
open (QSEQ_1_FILE, $qseq_file) or die "couldn't open $qseq_file for read\n";

# also, open (for output) the end1 temp file, so we can write to it while
# we process the end1 input file above
(open END_1_FILEOUT,">$end1_temp") or die "couldn't open $end1_temp for write\n";

# reads each line of the qseq data file one at a time. 
# assume each sample is kept on a separate line
while(<QSEQ_1_FILE>){
chomp; @line=split; # index and formats the qseq fields for parsing

# skip samples those that didn't pass QC
$line[10]>0 or next;

# process the samples here. look at only the samples that pass the quality 
# control and whose sequences contain the barcode+overhang at the beginning
# of the string, otherwise skip to the next sample in the data file 
# (i.e start the next iteration of the loop)

# trim the barcode+overhang from seq and qual
$line[8]= substr($line[8], $end1_trim_offset); #sequence
$line[9]= substr($line[9], $end1_trim_offset); #quality

# unique sequence identifier. don't worry about what $line[1-6] represent
# just know that $identifier is unique for each of the 'good' samples
my $identifier = $line[0].'-'.$line[1].'-'.$line[2].'_'.$line[3].'-'.$line[4].'-'.$line[5].'_#'.$line[6];

# store the identifiers of the 'good' samples in a hash.
# the hash should contain the identifier as the key and numbers (1,2,3,etc.)
# as the values. the following increments the hash values for each identifier.
$to_keep{$identifier}++;

# the following below is suppose to write information to the filehandle in fastq format:
# @[the identifier]/1
# sequence [ATGCAGTAAT...]
# +[the identifier]/1
# quality [ASCII characters]

print END_1_FILEOUT '@' . "$identifier/1\n$line[8]\n" . '+' . identifier/1\n$line[9]\n";

}
print "Found " . int(keys %to_keep) . " reads from end1\n";
# close the filehandles
close QSEQ_FILE; close END_1_FILEOUT;

此时,我有一个哈希,其中只包含'good'序列的标识符,并将fastq数据写入存储的位置。 Script01用于将fastq输出保存到磁盘上的临时文件中。

$end1_temp = '~/tmp/sampleD1_end1.fq'end2_temp = '~/tmp/sampleD1_end2.fq'

问题:上面的print END_1_FILE行是将fastq数据写入文件句柄还是写入$ end1_temp变量?我问,因为我需要将$ end1_temp和$ end2_temp变量传递给script02。另外,为了调试,如何查看script01的fastq输出?

这是我需要帮助的其余代码。它位于相同的脚本上,并直接遵循上面的代码:

# if the sequence is paired (has both end1 and end2 data), then the qseq_2 file exists and the conditions evaluates to true
if ($end2_temp) {
# changes the qseq filename from file name from end1 to end2 data
# don't worry about why it works
$qseq_file=~ s/'_1_'/'_2_'/;

open (QSEQ_2_FILE, $qseq_file) or die "could not open $qseq_file for read\n";

# open (for output) the end2 temp file, so we can write to it while we process
# the end2 input file above
open (END_2_FILEOUT, ">$end2_temp") or die "could not open $end2_temp for write\n";

# reads each line of the end2 file one at a time
while(<QSEQ_2_FILE>){
# skip comments
/^\#/ and next;

chomp; #keep the qseq fields on one line.
my @line= split; #indexes the qseq fields.

# unique sequence identifier that preserves the sequencing information
# in other words, samples from end1 and end2 will have the same unique
# identifier because they contain the exact same fields in columns 0-6 
# or $line[0-6]. also end1 and end2 have the same number of samples
my $identifier = $line[0] .'-'.$line[1].'-'.$line[2].'_'.$line[3].'-'.$line[4].'-'.$line[5].'_#'.$line[6];

# recall the %to_keep hash which stored the identifiers of the 'good' samples; 
# it was inside the QSEQ_1_FILE loop
# here I only want the end2 samples whose identifiers match the identifiers
# from the end1 samples
# skip sample where end1 didn't pass; does this work??
# the condition is suppose to evaluate to false if the identifiers don't match
$to_keep{$identifier} or next;

# trim the barcode+overhang from seq and qual
$line[8]=substr($line[8], $end2_trim_offset);  # sequence
$line[9]=substr($line[9], $end2_trim_offset);  # quality

print END_2_FILEOUT '@' . "$identifier/2\n$line[8]\n" . '+' . "$identifier/2\n$line[9]\n";

}

close QSEQ_2_FILE; close END_2_FILEOUT;

}

这是script01的结尾。此时,我应该将end1和end2的'good'样本的fastq数据写入单独的存储位置。 Script01用于将fastq输出保存到磁盘上的两个临时文件中。我想我的问题是为了调试,如何查看script01创建的tmp文件?

最后,当我将命令script01.pl [input parameters] > tmp.txt输入Linux终端时,它会将script01的输出保存到tmp.txt。 “从end1发现X读取”是脚本在处理end1读取后打印的内容,其中X是%to_keep哈希中的读取次数。

当我查看tmp.txt时,它显示Found 0 reads from end1.因为它打印0,这意味着哈希中没有任何内容存储。它假设从end1存储大约630万个读取。有人可以帮我弄清楚为什么没有任何读取存储在哈希中?

我认为问题是没有读取通过我正在使用的标准来决定它们是否应该存储在哈希中。另一个问题可能是我如何存储标识符。

你们可以看看它,看看我有什么可能错过的吗?

感谢。我的问题的任何建议或答案都非常感谢。

1 个答案:

答案 0 :(得分:1)

  

问题:上面的打印END_1_FILE行是否写入fastq数据   文件句柄还是写入$ end1_temp变量?

我认为你的意思是END_1_FILEOUT。这是一个文件的可写文件句柄,其文件的名称存储在$end1_temp中。数据将被发送到文件。

  

另外,为了调试,我如何查看来自的fastq输出   script01?

你查看文件。根据您的问题,我认为该文件名为~/tmp/sampleD1_end1.fq。你应该看看那里。

  

Script01假设将fastq输出保存到两个临时文件中   磁盘。我想我的问题是为了调试,我该如何查看   script01创建的tmp文件?

和以前一样,输出应该在名称为$end1_temp$end2_temp的文件中,大概是~/tmp/sampleD1_end1.fq~/tmp/sampleD1_end2.fq。用你的编辑打开它们看一看。

  

最后,当我输入命令script01.pl [输入参数]&gt;   tmp.txt进入Linux终端,它将script01的输出保存到   tmp.txt。当我查看tmp.txt时,它显示“从end1找到0读取”。   由于它打印0,这意味着散列中没有任何内容。它的   假设从end1存储大约630万个读取。有人可以帮忙吗   我弄清楚为什么没有任何读取存储在哈希中?

这是基本调试。从您的代码可以清楚地看出,您正在阅读的QSeq文件是空的,或者您的测试$line[10] > 0 or next正在丢弃每条记录。通过在print后直接添加诊断split语句来显示$line[10]的值,两者都很容易检查。我敢打赌,你正在寻找错误的领域。

除此之外,您还应该缩进代码。这将有助于我们比广泛的评论更好地理解它。另外,

$line[10] > 0 or next

最好写成

next unless $line[10] > 0

你应该使用open和词法文件句柄的三参数形式。

以下是您的代码的整理版本,包括这些改进以及更多

use strict;
use warnings;

die qq(Insufficient arguments "@ARGV") unless @ARGV >= 4;

my ($end1_temp, $end2_temp, $qseq1_file, $barcode) = @ARGV;

my $overhang         = 'C[AT]GC';
my $end1_trim_offset = length($barcode) + 4;
my $end2_trim_offset = 4;

my $idformat = '%s-%s-%s_%s-%s-%s_#%s';

my %to_keep;

open my $qsec1, '<', $qseq1_file
    or die "couldn't open $qseq1_file for read\n";

open my $end1, '>', $end1_temp
    or die "couldn't open $end1_temp for write\n";

while (<$qsec1>) {

  my @line = split;
  next unless $line[10] > 0;

  $_ = substr($_, $end1_trim_offset) for @line[8,9];

  my $identifier = sprintf $idformat, @line;
  $to_keep{$identifier}++;

  print $end1
      '@' . "$identifier/1\n" .
      "$line[8]\n" .
      '+' . "$identifier/1\n" .
      "$line[9]\n";
}

close $qsec1;
close $end1;

printf "Found %d reads from end1\n", int keys %to_keep;


exit unless $end2_temp;

$qseq1_file =~ s/'_1_'/'_2_'/;

open my $qseq2, '<', $qseq1_file
    or die "could not open $qseq1_file for read\n";

open my $end2, '>', $end2_temp
    or die "could not open $end2_temp for write\n";

while (<$qseq2>) {

  next if /^#/;
  my @line = split;

  my $identifier = sprintf $idformat, @line;
  next unless $to_keep{$identifier};

  $_ = substr($_, $end2_trim_offset) for @line[8,9];

  print $end2
      '@' . "$identifier/2\n" .
      "$line[8]\n" .
      '+' . "$identifier/2\n" .
      "$line[9]\n";
}

close $qseq2;
close $end2;