我有超过200个多序列fasta文件,并且在每个fasta文件中,有一些序列可供选择基因的数百个样本(即样本输入fasta文件中的PF3D7_1467550)。 fasta文件中的大多数样本(即样本303.1-样本输入文件中的第一个序列)具有一个序列,但是其他样本(即IGS-MLW-089sA和IGS-MWI-254sA)具有需要连接的基因的多个序列一起。
示例输入fasta文件
>303.1_assembled_PF3D7_1475500.[1:126].sp.tr
MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
EEKKTYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVI
KNVQ
>IGS-MLW-089sA_assembled_PF3D7_1475500.[1:61].sp.tr
MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
>IGS-MLW-089sA_assembled_PF3D7_1475500.[65:126].sp.tr
TYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVIKNVQ
>IGS-MWI-254sA_assembled_PF3D7_1475500.[1:61].sp.tr
MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
>IGS-MWI-254sA_assembled_PF3D7_1475500.[65:119].sp.tr
TYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDC
期望的输出:
>303.1_assembled_PF3D7_1475500.[1:126].sp.tr
MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
EEKKTYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVI
KNVQ
>IGS-MLW-089sA_assembled_PF3D7_1475500.[1:61][65:126].sp.tr
MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
TYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVIKNVQ
>IGS-MWI-254sA_assembled_PF3D7_1475500.[1:61][65:119].sp.tr
MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
TYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDC
我相信来自另一张票的代码可能很有用。
%hash;
while (<DATA>) {
if (/^>(miRNA\d+)/) {
$hash{$1}[0] = $_;
chomp($n = <DATA>);
unshift @{$hash{$1}[1]}, $n;
}
}
for $k (sort keys %hash) {
print $hash{$k}[0], join(',', @{$hash{$k}[1]}), "\n";
}
以下是上一张票的链接:
I need search a pattern in a header line of my file and concatenates the next line with Perl
我正在寻找帮助来修改处理选择sampleID或替代建议的以下代码部分。
/^>(miRNA\d+)/
谢谢
答案 0 :(得分:3)
如果要连接的样本是相邻的,您可以只收集范围(例如[1:61]
)和要打印成两个数组的行。
#!/usr/bin/perl
use warnings;
use strict;
sub without_ranges {
my ($header) = @_;
( my $without = $header ) =~ s/\[[^\]]+\]//g;
return $without
}
sub output {
my ($header, $ranges, $buffer) = @_;
my $header_with_ranges = $header;
$header_with_ranges =~ s/(.*\])/$1\[$_]/ for @$ranges;
print $header_with_ranges, @$buffer;
}
my (@buffer, @ranges);
my $header = "";
while (<>) {
if (/^>/) {
my $new_header = $_;
if (without_ranges($new_header) eq without_ranges($header)) {
push @ranges, $new_header =~ /\[([^\]]+)\]/;
} else {
output($header, \@ranges, \@buffer) if $header;
$header = $new_header;
@buffer = @ranges = ();
}
last if eof;
} else {
push @buffer, $_;
}
}
output($header, \@ranges, \@buffer);
答案 1 :(得分:-1)
来自另一张票的代码没有那么有用,而且有点......不够理想,坦率地说
这是一种可能的解决方案,假设您始终具有范围[x:y]
。
use strict; use warnings;
my (%hash,$key,$start
);
while(<DATA>) {
chomp;
if (m{^(>.*?)(?:\[(\d+):(\d+)\]\.sp\.tr)?$}) {
($key,$start)=($1,$2);
next;
}
$hash{$key}{$start}.=$_;
}
for my $key (sort keys %hash) {
my $keyref=$hash{$key};
printf "%ssp.tr\n%s\n", $key, join (''
, map { $keyref->{$_}} sort {$a<=>$b} keys %$keyref
);
}
__DATA__
>303.1_assembled_PF3D7_1475500.[1:126].sp.tr
MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
EEKKTYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVI
KNVQ
>IGS-MLW-089sA_assembled_PF3D7_1475500.[1:61].sp.tr
MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
>IGS-MLW-089sA_assembled_PF3D7_1475500.[65:126].sp.tr
TYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVIKNVQ
>IGS-MWI-254sA_assembled_PF3D7_1475500.[1:61].sp.tr
MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNY
>IGS-MWI-254sA_assembled_PF3D7_1475500.[65:119].sp.tr
TYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDC
>303.1_assembled_PF3D7_1475500.sp.tr
MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNYEEKKTYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVIKNVQ
>IGS-MLW-089sA_assembled_PF3D7_1475500.sp.tr
MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNYTYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDCAVIKNVQ
>IGS-MWI-254sA_assembled_PF3D7_1475500.sp.tr
MHHLLFIIWYIILNYYVSGQESATNFYKFIDSFASSTYISEESGSSAYDAKRAIQNNPNYTYDEELKESKEKANDLNNKLSLLTSVNVNTLDSDILKLGILPGDSYNFPANDC