在Perl正则表达式中捕获存储字符串时遇到问题?

时间:2016-10-21 20:08:30

标签: regex perl

所以我现在已经解决了这个问题了一段时间。

我有一个包含100个FASTA序列的文件,如下所示:

> GI | 192567 | GB | AAA37417.1 |囊性纤维化跨膜传导调节因子[Mus musculus] MQKSPLEKASFISKLFFSWTTPILRKGYRHHLELSDIYQAPSADSADHLSEKLEREWDREQASKKNPQLIHALRRCFFWRFLFYGILLYLGEVTKAVQPVLLGRIIASYDPENKVERSIAIYLGIGLCLLFIVRTLLLHPAIFGLHRIGMQMRTAMFSLIYKKTLKLSSRVLDKISIGQLVSLLSNNLNKFDEGLALAHFIWIAPLQVTLLMGLLWDLLQFSAFCGLGLLIILVIFQAILGKMMVKYRDQRAAKINERLVITSEIIDNIYSVKAYCWESAMEKMIENLREVELKMTRKAAYMRFFTSSAFFFSGFFVVFLSVLPYTVINGIVLRKIFTTISFCIVLRMSVTRQFPTAVQIWYDSFGMIRKIQDFLQKQEYKVLEYNLMTTGIIMENVTAFWEEGFGELLQKAQQSNGDRKHSSDENNVSFSHLCLVGNPVLKNINLNIEKGEMLAITGSTGLGKTSLLMLILGELEASEGIIKHSGRVSFCSQFSWIMPGTIKENIIFGVSYDEYRYKSVVKACQLQQDITKFAEQDNTVLGEGGVTLSGGQRARISL​​ARAVYKDADLYLLDSPFGYLDVFTEEQVFESCVCKLMANKTRILVTSKMEHLRKADKILILHQGTSYFYGTFSELQSLRPSFSSKLMGYDTFDQFTEERRSSILTETLRRFSVDDSSAPWSKPKQSFRQTGEVGEKRKNSILNSFSSVRKISIVQKTPLCIDGESDDLQEKRLSLVPDSEQGEAALPRSNMIATGPTFPGRRRQSVLDLMTFTPNSGSSNLQRTRTSIRKISLVPQISLNEVDVYSRRLSQDSTLNITEEINEEDLKECFLDDVIKIPPVTTWNTYLRYFTLHKGLLLVLIWCVLVFLVEVAASLFVLWLLKNNPVNSGNNGTKISNSSYVVIITSTSFYYIFYIYVGVADTLLALSLFRGLPLVHTLITASKILHRKMLHSILHAPMSTISKLKAGGILNRFSKDIAILDDFLPLTIFDFIQLVFIVIGAI IVVSALQPYIFLATVPGLVVFILLRAYFLHTAQQLKQLESEGRSPIFTHLVTSLKGLWTLRAFRRQTYFETLFHKALNLHTANWFMYLATLRWFQMRIDMIFVLFFIVVTFISILTTGEGEGTAGIILTLAMNIMSTLQWAVNSSIDTDSLMRSVSRVFKFIDIQTEESMYTQIIKELPREGSSDVLVIKNEHVKKSDIWPSGGEMVVKDLTVKYMDDGNAVLENISFSISPGQRVGLLGRTGSGKSTLLSAFLRMLNIKGDIEIDGVSWNSVTLQEWRKAFGVITQKVFIFSGTFRQNLDPNGKWKDEEIWKVADEVGLKSVIEQFPGQLNFTLVDGGYVLSHGHKQLMCLARSVLSKAKIILLDEPSAHLDPITYQVIRRVLKQAFAGCTVILCEHRIEAMLDCQRFLVIEESNVWQYDSLQALLSEKSIFQQAISSSEKMRFFQGRHSSKHKPRTQITALKEETEEEVQETRL

我已经编写了一个打开文件的子程序,并且每次读取一个序列。对于每个序列,我想在开头添加gi编号,在大写字母中添加长序列作为增长数组的字符串。但是,我在编写正则表达式时难以存储这些值。这是我当前的子程序,我调整了以查看我是否实际存储了gi编号:

sub getFASTA {
    my ($filename) = @_;
    my @FASTA_arr;
    $/ = "\n\n";
    open (my $fh, '<', $filename) or
            die ("Could not open file: $filename");
    while (<$fh>) {
            chomp $_;
            $_ =~ /^>gi|(\d*?)|/s;
            say "$1";
    }
    close $fh;
    #say join(" ", @FASTA_arr);
}

然而,试图运行它会返回:

Use of uninitialized value $1 in string at sequenceAlignment.pl line 30, <$fh> chunk 1.

每个序列返回一次,总共100次。

所以任何想法都是错的?我几乎可以肯定这是正则表达式的问题,因为当我将其更改为&#34; $ _ =〜/(&gt; gi |)/ s;&#34;时,它工作正常,只需100&#34;&gt; gi |&#34; s打印出来。

1 个答案:

答案 0 :(得分:0)

|表示正则表达式中的OR。逃避它。 (好像perl想出了你在捕获组结束时“真正”意味着什么并且没有第二个操作数)