所以我现在已经解决了这个问题了一段时间。
我有一个包含100个FASTA序列的文件,如下所示:
> GI | 192567 | GB | AAA37417.1 |囊性纤维化跨膜传导调节因子[Mus musculus] MQKSPLEKASFISKLFFSWTTPILRKGYRHHLELSDIYQAPSADSADHLSEKLEREWDREQASKKNPQLIHALRRCFFWRFLFYGILLYLGEVTKAVQPVLLGRIIASYDPENKVERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHRIGMQMRTAMFSLIYKKTLKLSSRVLDKISIGQLVSLLSNNLNKFDEGLALAHFIWIAPLQVTLLMGLLWDLLQFSAFCGLGLLIILVIFQAILGKMMVKYRDQRAAKINERLVITSEIIDNIYSVKAYCWESAMEKMIENLREVELKMTRKAAYMRFFTSSAFFFSGFFVVFLSVLPYTVINGIVLRKIFTTISFCIVLRMSVTRQFPTAVQIWYDSFGMIRKIQDFLQKQEYKVLEYNLMTTGIIMENVTAFWEEGFGELLQKAQQSNGDRKHSSDENNVSFSHLCLVGNPVLKNINLNIEKGEMLAITGSTGLGKTSLLMLILGELEASEGIIKHSGRVSFCSQFSWIMPGTIKENIIFGVSYDEYRYKSVVKACQLQQDITKFAEQDNTVLGEGGVTLSGGQRARISLARAVYKDADLYLLDSPFGYLDVFTEEQVFESCVCKLMANKTRILVTSKMEHLRKADKILILHQGTSYFYGTFSELQSLRPSFSSKLMGYDTFDQFTEERRSSILTETLRRFSVDDSSAPWSKPKQSFRQTGEVGEKRKNSILNSFSSVRKISIVQKTPLCIDGESDDLQEKRLSLVPDSEQGEAALPRSNMIATGPTFPGRRRQSVLDLMTFTPNSGSSNLQRTRTSIRKISLVPQISLNEVDVYSRRLSQDSTLNITEEINEEDLKECFLDDVIKIPPVTTWNTYLRYFTLHKGLLLVLIWCVLVFLVEVAASLFVLWLLKNNPVNSGNNGTKISNSSYVVIITSTSFYYIFYIYVGVADTLLALSLFRGLPLVHTLITASKILHRKMLHSILHAPMSTISKLKAGGILNRFSKDIAILDDFLPLTIFDFIQLVFIVIGAI IVVSALQPYIFLATVPGLVVFILLRAYFLHTAQQLKQLESEGRSPIFTHLVTSLKGLWTLRAFRRQTYFETLFHKALNLHTANWFMYLATLRWFQMRIDMIFVLFFIVVTFISILTTGEGEGTAGIILTLAMNIMSTLQWAVNSSIDTDSLMRSVSRVFKFIDIQTEESMYTQIIKELPREGSSDVLVIKNEHVKKSDIWPSGGEMVVKDLTVKYMDDGNAVLENISFSISPGQRVGLLGRTGSGKSTLLSAFLRMLNIKGDIEIDGVSWNSVTLQEWRKAFGVITQKVFIFSGTFRQNLDPNGKWKDEEIWKVADEVGLKSVIEQFPGQLNFTLVDGGYVLSHGHKQLMCLARSVLSKAKIILLDEPSAHLDPITYQVIRRVLKQAFAGCTVILCEHRIEAMLDCQRFLVIEESNVWQYDSLQALLSEKSIFQQAISSSEKMRFFQGRHSSKHKPRTQITALKEETEEEVQETRL
我已经编写了一个打开文件的子程序,并且每次读取一个序列。对于每个序列,我想在开头添加gi编号,在大写字母中添加长序列作为增长数组的字符串。但是,我在编写正则表达式时难以存储这些值。这是我当前的子程序,我调整了以查看我是否实际存储了gi编号:
sub getFASTA {
my ($filename) = @_;
my @FASTA_arr;
$/ = "\n\n";
open (my $fh, '<', $filename) or
die ("Could not open file: $filename");
while (<$fh>) {
chomp $_;
$_ =~ /^>gi|(\d*?)|/s;
say "$1";
}
close $fh;
#say join(" ", @FASTA_arr);
}
然而,试图运行它会返回:
Use of uninitialized value $1 in string at sequenceAlignment.pl line 30, <$fh> chunk 1.
每个序列返回一次,总共100次。
所以任何想法都是错的?我几乎可以肯定这是正则表达式的问题,因为当我将其更改为&#34; $ _ =〜/(&gt; gi |)/ s;&#34;时,它工作正常,只需100&#34;&gt; gi |&#34; s打印出来。
答案 0 :(得分:0)
|表示正则表达式中的OR。逃避它。 (好像perl想出了你在捕获组结束时“真正”意味着什么并且没有第二个操作数)