我有2个文件(paired_end),其中包含来自同一生物体的Fastq格式的读取(File_R1.fastq和File_R2.fastq)。我想确定的覆盖深使用BWA和samtools(生物信息学),为做到这一点,我需要相同数量的读取,相同的名称(file_1两个文件:@ XX00341:4450:6341 1:N:0:AACGTTAA + TTGCAATT和file_2:@ XX00341:4450:6341 2:N:0:AACGTTAA + TTGCAATT),在这两个文件相同的顺序,问题是,这两个文件具有空读取(相应的读取每个文件,可以在空一个但不在其他!!!!),如果我尝试用bwa运行它由于空序列而失败。
所以我正在尝试创建一个perl脚本来提取两个文件中具有相同名称和顺序的NO空读。
这是的fastq文件的格式(每次读取具有4行:名称(@ XX00341:19:000H27K25:1:11101:4450:6341 1:N ....),序列(GAGGTGCGTGGTTGTCACCCC ...), Q_head(+)和质量(FFFFFFFFFFFFFFF ....),按此顺序。
file_r1.fastq(特征为= .... 1:N:0:AACGTTAA + TTGCAATT)
@XX00341:4450:6341 1:N:0:AACGTTAA+TTGCAATT
CGCCCATCATGAGGTGCGTGGTTGTCACCCCCGCAAACGCGGAGGTGTAAACAGGCTCACCTGTGGGTTGTTGGTAGTCGTTATTGTGCTTGCGCTGTTCATCTGGATTACTGGATTCGACGCGTTGAGC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@XX00341:14420:6341 1:N:0:AACGTTAA+TTGCAATT
GCTTTGTCGTCGTCGGTTTTAAAGTGAACCGCTTTACCTGTTTCTGCTTGAACTTGTTCTGCTTGAGGTGCTGCTTCTGGTTTTGTTTCTTCTTTCTGACAACCTACCGCTAGCATTACTGTTGCAGCAA
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFFAAFFFFFFAFFFFAFFFFFFFFFFAFAFFFFFAFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFAFFFFFFFAFFFFFFF/FFFFFFFFFFFF
@XX00341:10259:6342 1:N:0:AACGTTAA+TTGCAATT
+
@XX00341:6685:6342 1:N:0:AACGTTAA+TTGCAATT
CATTGGAAGGCAAGCCAGAACAAGGCAAGAATATTCCAGATGACATCGTGCGCGTTCGCATCGATCGCAACAGCGGTCTGCTGACTCATAAAGTGGATAGCACGTCCATGTTTGAATACTTTGAGAAAGG
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF6FFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFAFFFFFFAFFFFFFFFFFFFF
@XX00341:4051:6343 1:N:0:AACGTTAA+TTGCAATT
TCACCATGATCGGATTTATGAATGGTTTAGTGGACAGCATGATCAAAAATGCGATTGCTTGGCAAACCAGCCATTTGCAGATTCATCAAAGTGCTTATCTCGTTAATCCTGAACTGAAAGACACCATACC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@XX00341:12307:6344 1:N:0:AACGTTAA+TTGCAATT
TCCAAATTGAGAGTTAATGTGAAAAAAGAATGACTTTTTGCTTGTCATGCAGCGGATTTGTGTGATACCACTAATGACGCAAACATTTTCGTAGAACATTACACAATACTATTTAACGAAAAAAGAACGA
+
FAAFFAFAAFFAFFF/FF/FFAFFFFFFFFFAAFFF=F=AFFFFFFFFFFFFAFFFFFFFFFFAFFFFFFFFFFAFFFFFAAFFFFFFFFFF/FAFFFFAFFFFFFFFFFFFAAAFFAF///FAFFFFFF
@XX00341:24250:6345 1:N:0:AACGTTAA+TTGCAATT
CGAACATAGAGCAAGCTCTGGTAAGTCCCGATGGGAGCTTAAAGACACAAGTAGACCAAAAGCTCGCAGAACATGGCTTGAGCCGTAAAGTCACCGTGGCGTCGCGCAACTTTCTCACCATTCGGCATCT
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
file_r2.fastq(特征为= .... 2:N:0:AACGTTAA + TTGCAATT)
@XX00341:4450:6341 2:N:0:AACGTTAA+TTGCAATT
GCGGCTTTGGTAGCGAAGCGCGTCTACCAACCGCAGCCATGAAACAACTGGCGTTTGAAGTGGAAAAAACGGCAGCGGGCAGTATTCCGGTACTGATTGAAGCCATCGAACAAGTGGCGGTGCCTACCGC
+
AFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@XX00341:14420:6341 2:N:0:AACGTTAA+TTGCAATT
GCGGTTCGAGTGGCCAACGTTGAACTTCATGCTCGGTAAAAAAGCAACCATTTAACGTGGTGATGACAATTAAATATAGGAATAAATGAGAAATTCTTTGCATCACATCTAATAATCCGGTTTGTGTCTT
+
AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFAFAFFFFFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFFF
@XX00341:10259:6342 2:N:0:AACGTTAA+TTGCAATT
TCTTTTTTCGTTAAATAGTATTGTGTAAT
+
FFFFFFFFFFFFFAFFFFFFFFFFFFFFF
@XX00341:6685:6342 2:N:0:AACGTTAA+TTGCAATT
GAATAAATGCTGTCTTCGAGGCTGTTACCAACGTATTCGGTTGGCTCCGTGCCTTTCTCAAAGTATTCAAACATGGACGTGCTATCCACTTTATGAGTCAGCAGACCGCTGTTGCGATCGATGCGAACGC
+
AFFFFFFFFAFFFFAFFFFFFFFFFFFFFFFFFFFFFFF/FF/FFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFAFFFF=FF=FFFFFFFAFFFFFFF
@XX00341:4051:6343 2:N:0:AACGTTAA+TTGCAATT
GAAGCTATCATGCCATCGGCGAGAAACCGCTCCGATACGGCTTTCACGCTTTGATGCTTGTCTAACGTTGTCACGATGCTTTGCGAGTCAGGTATGGTGTCTTTCAGTTCAGGATTAACGAGATAAGCAC
+
AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@XX00341:12307:6344 2:N:0:AACGTTAA+TTGCAATT
GTTCATCTGTCTGTCGTTCTTTTTTCGTTAAATAGTATTGTGTAATGTTCTACGAAAATGTTTGCGTCATTAGTGGTATCACACAAATCCGCTGCATGACAAGCAAAAAGTCATTCTTTTTTCACATTAA
+
AFFFFFFFFFAFFFFFFFFAFFFFFFFFFFFFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFF=FFFFFFFAFFFFFFFAFFFFFFFFFF=FAFFFFFFFFAFFFAFFAFFFFFFFFA=AFF6AFFAFAFF
@XX00341:24250:6345 2:N:0:AACGTTAA+TTGCAATT
CCATAGCATGGCAATATCAAAATCCGCAACGGCGATCGGCGGCTTAACAGTAATGAGATCGTCTGAAAAGCCCTCCGCCAACGCCATTTTTTCTGGCACGATGGCAATGAGATCGCGCTTTTTCAATAGA
+
AFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFF
在这种情况下,file_1.fastq中的一个读取为空(@ XX00341:10259:6342 1:N ....),第二行和第四行为空(序列和质量),但不在file_2中.fastq(@ XX00341:10259:6342 2:N ....);在这个例子中,两个序列(每个文件中的4行)必须省略!
这是我要完成的代码:
#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;
my ($fastQ_R1, $fastQ_R2);
GetOptions (
'R1|r1=s' =>\$fastQ_R1,
'R2|r2=s' =>\$fastQ_R2,
);
sub extract_list {
my ($file_in) = @_;
open FILE, '<', $file_in or die "cant open the $file_in\n";
my (@elements, @list, @seq);
while(
defined(my $head = <FILE>) && # 1 line
defined(my $seq = <FILE>) && # 2 line
defined(my $qhead = <FILE>) && # 3 line
defined(my $quality = <FILE>) # 4 line
){
if ($seq=~ m/^$/g){
next;
}
else { push @seq, $head; }
}
close FILE;
foreach (@seq){
chomp;
@elements = split '\s', $_;
push @list, $elements[0]; # split to eliminate 1:N... (file_1) and 2:N... (file_2)
}
return @list;
}
my @list_R1 = extract_list ($fastQ_R1);
my @list_R2 = extract_list ($fastQ_R2);
到目前为止,我有两个文件中没有空序列(@ list_R1和@ list_R2)的列表,两个数组的比较将给出两个文件中存在的空读(有4行!!)(@ common_elements)从两个数组的比较中获得。
@ list_R2 =
@XX00341:4450:6341
@XX00341:14420:6341
@XX00341:6685:6342
@XX00341:4051:6343
@XX00341:12307:6344
@XX00341:24250:6345
@ list_R2 =
@XX00341:4450:6341
@XX00341:14420:6341
@XX00341:10259:6342
@XX00341:6685:6342
@XX00341:4051:6343
@XX00341:12307:6344
@XX00341:24250:6345
@common_elemnts =
@XX00341:4450:6341
@XX00341:14420:6341
@XX00341:6685:6342
@XX00341:4051:6343
@XX00341:12307:6344
@XX00341:24250:6345
因此,新的阵列(@common_elemnts)将被用于搜索并提取公共读出(4线)的foreach文件并给丝束输出文件:Files_R1_common.fastq(来自File_R1.fastq获得)和File_R2_common.fastq(从File_R2.fastq获得)。
任何建议,将不胜感激!!!! 非常感谢
答案 0 :(得分:2)
您最好使用哈希映射检查列表中的项目是否存在 构造公共哈希映射:
my %list1_map = map { $_, 1 } @list_R1;
my %common_map = map { $list1_map{$_} ? ($_, 1) : () )} @list_R2;
然后处理您的列表:
for my $item(@list_R1) {
if (defined ($common_map{$item})) {
# ... process ...
}
答案 1 :(得分:1)
在下面的解决方案中,我将FASTQ解析移动到一个单独的类,该类将从文件中读取4行并返回id,hdr,seq和qual的hashref,表示一条记录。我已经把这堂课作为练习留给你了。这样做可以使下面的逻辑易于理解。可读性很重要。
此代码只是保存来自file1的空seqs的ID(即$ hdr =〜/ ^ @(\ S +)/中的$ 1)。然后它读取file2并输出完整记录,并保存空记录的ID。最后,您再次读取file1并通过删除%筛选哈希指示的那些输出完整记录。
这个解决方案效率很低,因为它读取file1两次。原始问题中没有提到的是FASTQ文件通常包含数百万条记录,因此在%过滤哈希中保存file1中所有未过滤的记录需要大量的RAM,但如果你有大量的RAM,你可以改为适合,但我个人不担心两次阅读文件。
# SAVE IDS OF EMPTY RECORDS FROM FILE1 IN HASH
my %filter;
my $fq = new Fastq($file1);
while (my $rec = nextSeq($fq))
{
$rec->{seq} or $filter{$rec->{id}} = undef; # not using value
}
# ADD IDS OF EMPTY RECORDS FROM FILE2 TO HASH AND OUTPUT FILTERED_FILE2
open(my $out, '>', "$file2.filtered") or die($!);
$fq = new Fastq($file2);
while (my $rec = nextSeq($fq) )
{
if ( $rec->{seq} )
{
# READ2 FOR THIS PAIR IS NOT EMPTY
if ( ! exists($filter{$rec->{id}}) )
{
# READ1 FOR THIS PAIR IS ALSO NOT EMPTY
print $out join("\n", $rec->{hdr}, $rec->{seq}, '+', $rec->{qual}), "\n";
}
} else
{
# READ2 IS EMPTY, ADD TO FILTERED HASH
$filtered{$rec->{id}} = undef;
}
}
close($out);
# OUTPUT FILTERED_FILE1
open($out, '>', "$file1.filtered") or die($!);
$fq = new Fastq($file1);
while (my $rec = nextSeq($fq) )
{
if ( $rec->{seq} and ! exists($filter{$rec->{id}}) )
{
print $out join("\n", $rec->{hdr}, $rec->{seq}, '+', $rec->{qual}), "\n";
}
}
close($out);