在FASTQ中合并具有不同标题顺序的Fasta和Qual文件

时间:2018-12-09 17:50:41

标签: perl sorting bioinformatics fasta fastq

考虑到可能两个文件的序列ID顺序可能不同,因此我试图将一个fasta文件和一个qual文件合并到一个新的fastq文件中。为此,我尝试了脚本的第一步是对序列进行排序,当我将其作为单独的脚本进行测试时,该序列可以很好地工作。与其余部分相同,当我单独运行将文件合并到fastq中的部分时,它运行完美。但是现在我试图将两种方法结合在一个脚本中,这是行不通的,而且我不知道该怎么办!如果您能帮助我,我将不胜感激。

到目前为止,这是我的脚本。它创建了新的fastq文件,但是内容混乱了,不是我想要的。我从终端这样运行:

$ perl script.pl reads.fasta reads.qual > reads.fq

脚本:

#!/usr/bin/env perl
use strict;
use warnings;

die ("Usage: script.pl reads.fasta reads.qual > reads.fq") unless  (scalar @ARGV) == 2;

open FASTA, $ARGV[0] or die "cannot open fasta: $!\n";
open QUAL, $ARGV[1] or die "cannot open qual: $!\n";

my $offset = 33; 
my $count = 0;
local($/) = "\n>";

my %id2seq = ();
my $id = '';
my %idq2seq = ();
my $idq = '';
my (@sort_q, @sort_f);

while(<FASTA>){
    chomp;
        if($_ =~ /^>(.+)/){
            $id = $1;
        }else{
            $id2seq{$id} .= $_;
        }
     }

for $id (sort keys %id2seq)
    {
     @sort_f = "$id\n$id2seq{$id}\n\n";
     print @sort_f;
    }

while(<QUAL>){
chomp;
    if($_ =~ /^>(.+)/){
        $idq = $1;
    }else{
        $idq2seq{$idq} .= $_;
    }
}

for $idq (sort keys %idq2seq)
    {
    @sort_q = "$idq\n$idq2seq{$idq}\n\n";
    print "@sort_q";
    }

while (my @sort_f) {
chomp @sort_f;
my ($fid, @seq) = split "\n", @sort_f;   
my $seq = join "", @seq; $seq =~ s/\s//g;
my $sortq = @sort_q;
chomp my @sortq;
my ($qid, @qual) = split "\n", @sortq;

@qual = split /\s+/, (join( " ", @qual));
# convert score to character code:
my @qual2 = map {chr($_+$offset)} @qual;
my $quals = join "", @qual2; `enter code here`
die "missmatch of fasta and qual: '$fid' ne '$qid'" if $fid ne $qid;
$fid =~ s/^\>//;
print STDOUT (join( "\n", "@".$fid, $seq, "+$fid", $quals), "\n");
$count++;
}
close FASTA;
close QUAL;
print STDERR "wrote $count entries\n";

提前谢谢

1 个答案:

答案 0 :(得分:0)

自从我使用perl以来已经有一段时间了,但是我会在Fasta和Quality输入中使用键/值对的散列来实现这一点。然后通过遍历fasta哈希并拉出相应的质量字符串来写出所有对。

我已经在python中编写了可以满足您需要的内容,您可以在操作here中看到它:

它假定您的输入如下所示:
reads.fasta

>fa_0
GCAGCCTGGGACCCCTGTTGT
>fa_1
CCCACAAATCGCAGACACTGGTCGG

reads.qual

>fa_0
59 37 38 51 56 55 60 44 43 42 56 65 60 68 52 67 43 72 59 65 69
>fa_1
36 37 47 72 34 53 67 41 70 67 66 51 47 41 73 58 75 36 61 48 70 55 46 42 42

输出

@fa_0
GCAGCCTGGGACCCCTGTTGT
+
;%&387<,+*8A<D4C+H;AE
@fa_1
CCCACAAATCGCAGACACTGGTCGG
+
$%/H"5C)FCB3/)I:K$=0F7.**
@fa_2
TCGTACAGCAGCCATTTTCATAACCGAACATGACTC
+
C?&93A@:?@F,2:'KF*20CC:I7F9J.,:E8&?F

import sys


# Check there are enough arguments
if len(sys.argv) < 3:
    print('Usage: {s} reads.fasta reads.qual > reads.fq'.format(s=sys.argv[0]), file=sys.stderr)
    sys.exit(1)

# Initalise dictionaries for reads and qualities
read_dict = dict()
qual_dict = dict()

fa_input = sys.argv[1]
qual_input = sys.argv[2]

# Read in fasta input
with open(fa_input, 'r') as fa:
    for line in fa:
      line = line.strip()
      if line[0] == '>':
        read_dict[line[1:]] = next(fa).strip()
      else:
        next(fa)

# Read in quality input
with open(qual_input, 'r') as qual:
    for line in qual:
      line = line.strip()
      if line[0] == '>':
        qual_dict[line[1:]] = next(qual).strip()
      else:
        next(qual)

count = 0
# Iterate over all fasta reads
for key, seq in read_dict.items():
    # Check if read header is in the qualities data
    if key in qual_dict.keys():
        # There's both sequence and quality data so write stdout
        read_str = '@{header}\n{seq}\n+\n{qual}'.format(
            header=key,
            seq=seq,
            qual=''.join([chr(int(x)) for x in qual_dict[key].split(' ')]))
        print(read_str, file=sys.stdout)
        count += 1
    else:  # not found
        # Write error to stderr
        print('Error: {k} not found in qual file'.format(k=key), file=sys.stderr)

# Print count to stderr
print('{c} reads written'.format(c=count), file=sys.stderr)

如果您需要对质量得分使用偏移量,请编辑
qual=''.join([chr(int(x)) for x in qual_dict[key].split(' ')]))
qual=''.join([chr(int(x) + offset) for x in qual_dict[key].split(' ')]))并在此之前定义一个offset变量。