读取.fasta序列以提取核苷酸数据,然后写入TabDelimited文件

时间:2012-03-17 09:34:02

标签: perl bioinformatics base sequences fasta

在我继续之前,我想我会把读者推荐给我以前与Perl的问题,作为所有这些的初学者。

这些是我过去几天的帖子,按时间顺序排列:

  1. How do I average column values from a tab-separated data... (已解决)
  2. Why do I see no computed results in my output file? (已解决)
  3. Using a .fasta file to compute relative content of sequences
  4. 正如我上面所述,感谢你们中的一些人的帮助,我已经设法找出了前两个问题并且我真的从中学到了东西。我真的很感激。对于一个对此一无所知的人,并且仍然觉得他没有,这帮助实际上是天赐之物。

    最后一个查询仍未解决,这是一个延续。我确实看了一些推荐的文本,但是因为我想在星期一之前完成这个,我不确定我是否完全忽略了任何东西。无论哪种方式,我都尝试过这项任务。

    您知道,任务是打开并阅读.fasta文件(我想我终于把事情钉好了,哈利路亚!),读取每个序列计算相对G + C核苷酸含量,然后写入TABDelimited文件以及基因名称及其各自的G + C含量

    即使我已经尝试过这个,但我知道我已经没有准备好执行程序来提供我所追求的结果,这就是为什么我再次与你们联系的原因一些指导,或如何解决这个问题的例子。与我以前的已解决的查询一样,我希望它与我已经完成的类似 - 尽管它可能不是最方便/最有效的方式。它只是让我知道我正在做的每一步都在做什么,即使看起来我正在垃圾邮件!

    无论如何,.fasta文件的内容如下:

    >label
    sequence
    >label
    sequence
    >label
    sequence
    

    我不确定如何打开.fasta文件,因此我不确定哪些标签适用于哪个,但我知道基因应该标记为gagpolenv。我是否需要打开.fasta文件以了解我正在做什么,或者我可以通过使用上述格式“盲目地”执行此操作吗?

    这可能是非常明显的,但我仍然在努力解决所有这些问题。我觉得我现在应该抓住了!

    无论如何,我现有的代码如下:

    #!/usr/bin/perl -w
    # This script reads several sequences and computes the relative content of G+C of each sequence.
    use strict; 
    
    my $infile = "Lab1_seq.fasta";                               # This is the file path
    open INFILE, $infile or die "Can't open $infile: $!";        # This opens file, but if file isn't there it mentions this will not open
    my $outfile = "Lab1_SeqOutput.txt";             # This is the file's output
    open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open
    
    my $sequence = ();  # This sequence variable stores the sequences from the .fasta file
    my $GC = 0;         # This variable checks for G + C content
    
    my $line;                             # This reads the input file one-line-at-a-time
    while ($line = <INFILE>) {
        chomp $line;                      # This removes "\n" at the end of each line (this is invisible)
    
        foreach my $line ($infile) {
            if($line = ~/^\s*$/) {         # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
                next;
            } elsif($line = ~/^\s*#/) {        # This finds lines with spaces before the hash character. Removes .fasta comment
                next; 
            } elsif($line = ~/^>/) {           # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
                next;
            } else {
                $sequence = $line;
            }
        }
        {
            $sequence =~ s/\s//g;               # Whitespace characters are removed
            return $sequence;
        }
    

    我不确定这里有什么是正确的,但执行它会给我留下语法错误ar第35行(超出最后一行,因此没有任何东西!)。它在'EOF'上说。这就是我能指出的一切。否则,我试图弄清楚如何计算每个序列中核苷酸G + C的数量,然后在输出.txt文件中正确制表。我相信这是TABDelimited文件的意思吗?

    无论如何,如果这个查询过于冗长,“愚蠢”或重复,我会道歉,但是在说,我找不到与此直接相关的任何信息,所以非常感谢您的帮助,并且如果可能的话,每个步骤的解释也是!!

    最善良。

1 个答案:

答案 0 :(得分:2)

你最后有一个额外的支撑。这应该有效:

#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.

use strict; 

my $infile = "Lab1_seq.fasta";                               # This is the file path
open INFILE, $infile or die "Can't open $infile: $!";        # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt";             # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open

my $sequence = ();  # This sequence variable stores the sequences from the .fasta file
my $GC = 0;         # This variable checks for G + C content

my $line;                             # This reads the input file one-line-at-a-time

while ($line = <INFILE>) {
    chomp $line;                      # This removes "\n" at the end of each line (this is invisible)

    if($line =~ /^\s*$/) {         # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
        next;

    } elsif($line =~ /^\s*#/) {        # This finds lines with spaces before the hash character. Removes .fasta comment
        next; 
    } elsif($line =~ /^>/) {           # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
        next;
    } else {
        $sequence = $line;
    }

    $sequence =~ s/\s//g;               # Whitespace characters are removed
    print OUTFILE $sequence;
}

我也编辑了你的回程线。返回将退出循环。我怀疑你想要的是将它打印到文件中,所以我已经这样做了。您可能需要先进行一些进一步的转换,以使其成为制表符分隔格式。