我是一个Perl新手,坚持另一个需要一些帮助和输入的生物信息学问题。
问题简述:
我有一个文件,它有超过40,000个独特的 DNA序列。通过唯一,我的意思是唯一的序列ID。我在帖子的末尾附上了一部分,以帮助你展示它的样子。
我需要将40,000个序列中的每个分成3个部分。因此,如果特定序列长度为999个字符,则3个部分中的每个将具有333个字符。
我需要通过3个单独的部分来寻找以下模式:
$ gpat = [G] {3,5};
$ npat = [A-Z] {1,25};
$ pattern = $ gpat。$ npat。$ gpat。$ npat。$ gpat。$ npat。$ gpat;
如果$ pattern出现在3个部分的第一个中,则增加'开始'的计数器,如果3个部分中的第2个出现$ pattern,则增加'middle'的计数器,最后如果$ pattern出现在第3部分,增加'结束'的反击。
打印“开始”,“中间”和“结束”的计数器,即基本上每个序列的“开始”,“中间”,“结束”的总和。
在第一个序列中说,值分别为'2','5','3',在第二个序列中,值为'4','1','6',最终计数应为' 7,6,9' 。
我遇到的问题:
gggatgtcgatgcatggggatgcatcgatgcggggactagctagcgggatgctacgatggggatgatgataatatcgcggcgcatatatgctagtctatatatta
分为3个部分,产生以下3个子部分,每个部分长度为35个字符:
gggatgtcgatgcatggggatgcatcgatgcgggg
actagctagcgggatgctacgatggggatgatgat
aatatcgcggcgcatatatgctagtctatatatta
因此, $ pattern被拆分为前两部分。无论如何说“如果$ pattern从第1部分开始到第2部分结束”,增加“开始”的数量?
##更新## 由于Cupidvogel建议的代码,以下问题已得到解决
2.如果序列长度不能被3整除,我如何将序列分成3个部分?我尝试使用
int
,但最后一部分是1-2 人物简短。
以下是我到目前为止编写的代码。
它读入文件,显示标题名称和序列,每个序列将被分成的长度,最后序列分成3个部分,如果序列长度可以被3整除,则可以正常工作。 t,最后的第3部分是1-2个字符短。
#Take Filename from user
print "Please enter file name : ";
$in =<>;
chomp $in;
open (FASTA,"$in") or die ;
while (<FASTA>)
{
$/=">";
@array = split '\n', $_;
$header=shift @array; # Header of the fasta sequence
print "\n\nNext sequence: \n";
print $header,"\n";
$seq= join '', @array; # sequence
$seq=~s/\s//g;
$seq=~s/\*//g;
$seq=~s/>//g;
print $seq,"\n\n";
$num = int(length($seq)/3);
@arr = unpack("A$num A$num A*",$seq);
print " New method gives this :", @arr;
print "\nThe first element is :", $arr[0];
print "\nThe second element is :",$arr[1];
print "\nThe third element is :",$arr[2] ;
#The following lines of code were originally written to split...
#...the sequence into 3 parts, albeit unsuccessfully
#my $split = (length $seq)/3;
#print $split,"\n\n";
#my $int = int $split;
#print $int,"\n\n";
#my @array2 = $seq =~ /(.{$int})/g;
#print join (" ", @array2),"\n\n";
#print $array2[0],"\n",$array2[1],"\n",$array2[2];
}
exit;
到目前为止,我一直在尝试使用以下示例文件编写的代码:sample.fa
>ABC_123 2
atgtcgatcgatcggcgggcatgcgcgcgcggatg
atatatagcgcgcgctatatagcgcgactctacgc
atgctgctgactagctatagtcgctgactgcgcgt
gggaaaaagggcccgggccccgttttggggatcta
ggggatagctgatgctagcatgcatgctgactgca
>DEF_456 4
gggatgtcgatgcatggggatgcatcgatgcgggg
actagctagcgggatgctacgatggggatgatgat
aatatcgcggcgcatatatgctagtctatatatta
>GHI_789 1
atagctgctagtcgatcggcgcgggtatcgatcgg
ggatcgatcgatcggggatcgatcgggggatcgat
实际输入文件如下所示:
>NR_037701 1
aggagctatgaatattaatgaaagtggtcctgatgcatgcatattaaaca
tgcatcttacatatgacacatgttcaccttggggtggagacttaatattt
aaatattgcaatcaggccctatacatcaaaaggtctattcaggacatgaa
ggcactcaagtatgcaatctctgtaaacccgctagaaccagtcatggtcg
gtgggctccttaccaggagaaaattaccgaaatcactcttgtccaatcaa
agctgtagttatggctggtggagttcagttagtcagcatctggtggagct
gcaagtgttttagtattgtttatttagaggccagtgcttatttagctgct
agagaaaaggaaaacttgtggcagttagaacatagtttattcttttaagt
gtagggctgcatgacttaacccttgtttggcatggccttaggtcctgttt
gtaatttggtatcttgttgccacaaagagtgtgtttggtcagtcttatga
cctctattttgacattaatgctggttggttgtgtctaaaccataaaaggg
aggggagtataatgaggtgtgtctgacctcttgtcctgtcatggctggga
actcagtttctaaggtttttctggggtcctctttgccaagagcgtttcta
ttcagttggtggaggggacttaggattttatttttagtttgcagccaggg
tcagtacatttcagtcacccccgcccagccctcctgatcctcctgtcatt
cctcacatcctgtcattgtcagagattttacagatatagagctgaatcat
ttcctgccatctcttttaacacacaggcctcccagatctttctaacccag
gacctacttggaaaggcatgctgggtctcttccacagactttaagctctc
cctacaccagaatttaggtgagtgctttgaggacatgaagctattcctcc
caccaccagtagccttgggctggcccacgccaactgtggagctggagcgg
gagggaggagtacagacatggaattttaattctgtaatccagggcttcag
ttatgtacaacatccatgccatttgatgattccaccactccttttccatc
tcccagaagcctgctttttaatgcccgcttaatattatcagagccgagcc
tggaatcaaactgcctctttcaaaacctgccactatatcctggctttgtg
acctcagccaagttgcttgactattctcagtctcagtttctgcacctgtc
aaatagggtttatgttaacctaactttcagggctgtcaggattaaatgag
catgaaccacataaaatgtttggtgtatagtaagtgtacagtaaatactt
ccattatcagtccctgcaattctatttttcttccttctctacacagcccc
tgtctggctttaaaatgtcctgccctgctttttatgagtggataccccca
gccctatgtggattagcaagttaagtaatgacactcagagacagttccat
ctttgtccataacttgctctgtgatccagtgtgcatcactcaaacagact
atctcttttctcctacaaaacagacagctgcctctcagataatgttgggg
gcataggaggaatgggaagcccgctaagagaacagaagtcaaaaacagtt
gggttctagatgggaggaggtgtgcgtgcacatgtatgtttgtgtttcag
gtcttggaatctcagcaggtcagtcacattgcagtgtgtcgcttcacctg
gctccctcttttaaagattttccttccctctttccaactccctgggtcct
ggatcctccaacagtgtcagggttagatgccttttatgggccacttgcat
tagtgtcctgatagaggcttaatcactgctcagaaactgccttctgccca
ctggcaaagggaggcaggggaaatacatgattctaattaatggtccaggc
agagaggacactcagaatttcaggactgaagagtatacatgtgtgtgatg
gtaaatgggcaaaaatcatcccttggcttctcatgcataatgcatgggca
cacagactcaaaccctctctcacacacatacacatatacattgttattcc
acacacaaggcataatcccagtgtccagtgcacatgcatacacgcacaca
ttcccttcctaggccactgtattgctttcctagggcatcttcttataaga
caccagtcgtataaggagcccaccccactcatctgagcttatcaaccaat
tacattaggaaagactgtatttcctagtaaggtcacattcagtagtactg
agggttgggacttcaacacagctttttgggggatcataattcaacccatg
acagccactgagattattatatctccagagaataaatgtgtggagttaaa
aggaagatacatgtggtacaaggggtggtaaggcaagggtaaaaggggag
ggaggggattgaactagacacagacacatgagcaggactttggggagtgt
gttttatatctgtcagatgcctagaacagcacctgaaatatgggactcaa
tcattttagtccccttctttctataagtgtgtgtgtgcggatatgtgtgc
tagatgttcttgctgtgttaggaggtgataaacatttgtccatgttatat
aggtggaaagggtcagactactaaattgtgaagacatcatctgtctgcat
ttattgagaatgtgaatatgaaacaagctgcaagtattctataaatgttc
actgttattagatattgtatgtctttgtgtccttttattcatgaattctt
gcacattatgaagaaagagtccatgtggtcagtgtcttacccggtgtagg
gtaaatgcacctgatagcaataacttaagcacacctttataatgacccta
tatggcagatgctcctgaatgtgtgtttcgagctagaaaatccgggagtg
gccaatcggagattcgtttcttatctataatagacatctgagcccctggc
ccatcccatgaaacccaggctgtagagaggattgaggccttaagttttgg
gttaaatgacagttgccaggtgtcgctcattagggaaaggggttaagtga
aaatgctgtataaactgcatgatgtttgcaggcagttgtggttttcctgc
ccagcctgccaccaccgggccatgcggatatgttgtccagcccaacacca
caggaccatttctgtatgtaagacaattctatccagcccgccacctctgg
actccctcccctgtatgtaagccctcaataaaaccccacgtctcttttgc
tggcaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaa
>NM_198399 1
aacagattttaactctgaaaagccatttccagtgtctatagactattgtg
agcctggagaagtagcatttagttgggatagcttcactagagctgcctgc
caaagacttccttccacaggatcttgtcgcaccagcaactgacaggagct
tgggagctcgggagcttgggagagggcttatgtttttaataatgtagctg
tcagttcgaagcctggaaatgttgaccctcaaagggcataaaatcttgtt
attttaatttgcatctgggagaatgtctgagcaaggagacctgaatcagg
caatagcagaggaaggagggactgagcaggagacggccactccagagaac
ggcattgttaaatcagaaagtctggatgaagaggagaaactggaactgca
gaggcggctggaggctcagaatcaagaaagaagaaaatccaagtcaggag
caggaaaaggtaaactgactcgcagccttgctgtctgtgaggaatcttct
gccagaccaggaggtgaaagtcttcaggatcagactctctgaaaactgca
aatggaaaggaattcaaaagaatttagattaaaagttaaataaaaagtag
gcacagtagtgctgaattttcctcaaaggctctcttttgataaggctgaa
ccaaatataatcccaagtatcctctctccttccttgttggagatgtctta
cctctcagctccccaaaatgcacttgcctataagaaacacaattgctggt
tcatatgaaacttaggaaatagtgaataaggtgcatttaactttggagaa
atacttttatggctttggtggagatttctcaatactgcaaaagttgtcca
gaaatgaatctgagctgatggtgactttaagttaatattattaatatatc
actgcatatttttacccttatttttgctccttacagcaagattagtaggt
tataaaaatttaaatttaaacaaaattatttcatgacaaaatgggaaact
tcacatcatacttatttttgtttgcctttcaggcatcatattagctttta
taaaaaatggtcttgctgctgaaattgtacttattttatcagaggctggg
tgcagtcaagacaaaagtaaaatggtttacctgagcccaggggagggaaa
attgattaagatatcattatttttgtttggtttggttttgcttttttcct
cttactttaattgaaatactctgaattcccctcatggaaacagagagcat
tgagagcactttctttaaaaggaccaaaaataaattcctaatagattttg
tcctaagagagtgtttttttttctagcatcattttctttacatgccactc
atgtcataaggcatggacaggctatctttcagtggccattactatgtttc
gtacacatgctttattttacttgggctctgagaaatgtgtggctttcctt
cagcattttatttgtgcttctctttttaatggagattgaaaagggagaat
aatgtgaatatcacggcttatattattaaatgttgattgatggcttgtaa
tgtactgcacacaatatatgttaactctgcagaatgacagaccctgggag
aagtaatgccccagttgtcccccactcctaatgccaggcagagaaggaca
gcctttatagacttaatctgctttttgtcccatttgacaaggtaccagga
ggaaattttttaagggatcaactgtatcacagtgcccactctggacctaa
gtctagtgtatccatacaattggtgcagagaaataaggtgtaaatggtgc
tttgttcctgctggttccaagctcagaaaccaagactagctttgtaggag
agaatgagagcctgcaagcctctctttggattggctgaggagtggtggga
gcagggggttgatagaaaacatccagacacacatataagcaagtggccgt
gctacctttttagagaataaagaaacagacttttgagtttatatgcaatg
ccttcattaggtaccaccggcacttacaaaatgtgcggactgaatcccag
agaacactggcagatgtatacagtatatggattgtatcgcttccccaatg
tttgtaaattcacagtatttggaaaactgccttcattttccagtgtggga
aaaactcttgctacctgtattacttgatctcagacccatacctgatggtt
cagtctgtccttaagttaaaagaattttgcttttctaatgttatactatt
tacctgtcagtgtattactgcaacttgaatcactcttttactgttgttgg
atataaacttatcctgtaccaatgtatttattaacacttgtattttatta
ttgagcatatcaataaaaatattaaaaaataacagattgttttttaccaa
aaaaaaaaaaaaa
>NR_026816 1
caacccactctctgtgctatgacttcattactctttcccagcccagccct
gggcaagccccttacgaagtctcaggctacctggatgaccaccctttctt
atgatgctgcaaggagggcaggtgggcagagccccgtgcatcctgggctc
aggccagggacccaagagcttgggagaagctggttctcagactgaaggcc
agagcccagcaccttgtcaccatcccggggagcatcatggcacacaacaa
ccagagccaaggctacagctagagagttgactcctctatttgagattgac
aggcctcggaagtcaaaataagtggtttcctagaccgggtcgagagcaag
tctctattggtcccaactgagttttttcagctggtttttcaaccaaacag
cacctcatctcccagtgaggggaagggaaggctgggctgagagcagcaag
gctgctcatctcacctctccccacccagccatgccagccgcctcacctgg
tggggagaggtgggcctcacctgggtcccctggcagtgctctgtgaaggg
tcttgacattgcactgtaataataaaggtgtgtgtgaagtatcaaaaaaa
>NR_027917 1
atgaagatgattgagcagcacaatcaggaatacagggaagggaaacacag
cttcacaatggccatgaacgcctttggagaaatgaccagtgaagaattca
ggcaggtggtgaatggctttcaaaaccagaagcacaggaaggggaaagtg
ctccaggaacctctgcttcatgacatccgcaaatctgtggattggagaga
gaaaggctacgtgactcctgtgaaggatcagtgcagctggggctctgtaa
ggacagatgttaggaaaactgagaaactagtttcactgagtgtgcagacc
tggtggactgctctaggcttcaaggcaatgttggctgcatttttggagaa
ccattattttgcttccagtatgttgccgacaatggaggcctggactctga
ggaatccttttcatatgaagaaaagctctggagactggaaagtccaaggt
cacagaggtgcatctggtgagagccttcttgctagtggggaatctcagca
gagtcctgaggtggcacagtattctgggaagcatcaagtgcagtgtcatc
ttatcgaggaggctctgcagatgctaagtggtggggatgaggatcacgat
gaagacaaatggccccatgacatgaggaatcatctggctggagaggccca
ggtgtag
>NR_002777 3
cttgtcctttcagaagatcagagacaagtgatatctgtgccaatttggcc
ttttcagtgttataattatggtgtcttgggatcccaatatttctcctaat
gtttccctgatgtgatactttgagagcccaggatgccagtacaataattg
aaattcacaaatgtctggtatcttgtccctcgtgccccatatattatctg
tggtttcggagagctcacttgtctcttatcttcagaaatgacagcacatg
aaatgttgtttggagccactgtcacatcaactgtagaaaaattaacaggt
cagctaagggatataatgtaactttatttgtgatatgagagaaatcttga
taaagacttgagagaaaactgggaggaaccttgtttagaagttataagga
ggggtaagttatgtgtgtcttggaaggagaatcataaatcttaaaacatg
agcctaatagagaacataaaattctaaaagataaagataataataatgat
aagccgcagggtggcttatgataatgtgacttctccttaccccagtagcg
tcggacatctgtcagctctgaaatgataaaaatgcacaatattgaataca
aacaaaggagtcagcactgaaattcattttctctccagattagggaaaga
gtaggtatgccctatggtagggcagtaaattgctgaatgatgagatgaaa
cagccacctagccatttcccattaaatataatcccatcagcagcagacaa
tatctatcctcccctatcccctctatccatatttggaaactgcaccctct
tccctatttagcaccctaacaccacttgaattccataaccctgttgttga
tctagctctcctcacctctaaacacttctagcattcctttcagatcagga
gctcgaaacactctcctttgattttttggaaaagtttctggcttcttcaa
ggtcacgttctccgtcctaagaattaaaaaaaaaaaaaaaaacttccaaa
cctttgaccttgtgtccgtggaacacccctgacttcctatcatttcaacc
cattgaggcacttgaactctcttcttggggatcctgagaagggagagtgc
aaactcttgaccctggaggcaaacaaaatgttctcatgtttgccttccca
cttactttctgtgagaacgtgggaagatcttaacctctcagaagcacagt
ttcttccttctaaaatgaaataattaacctctccctgtctacattcttaa
actcataggacataaaaaaaaaaaaaa
>NR_033769 1
ggcctctggcgggcctccagccagttagaccatttgactaggacgtgtgc
agctcagccagccacagaactggaatttttcaggagcagggggagcatgg
agtttggactttgctgagcaactgaagtggagcgcagagcttgctcgctt
aggagagggcagcatggatggcaaacaagggggcatggatgggagcaagc
ccacggggccaagagactctcctgacaccaggcttctttcaaacccattg
atgggtgattctgtgtctgattggtctcctatgcctgaagctgcaatcta
cggacatcagctgtctctgaggaacctcatcagccacgggtggcttgtga
acatcatcatggcagatcatgtttccccactccatgaagcctgtctcaga
ggtcatccctctcgtgtaaagattttattaaagcatggagctcaggtgaa
tggcgtgacaacagactggcacactccactgtttaatgtttgtatcagca
gcagctgggattatgcttctgcagcatggagccagcgttcaacctgagag
tgatctggcatcccccgtccatgaagctgctaggagaggccacgtggagt
gtgtcgactctcttacagcttataggggcaaaaatgaccataacatcagc
cacgtgggcacttcactgtatttggcttgtgaaaaccagcagatagcctg
tgtcaagaagcttctggagtcaggagcagacctgaacccagggagaggtt
ccccacttcatgcagtggccttcatgaaggccctcatgaaggattcccca
cttcatgcagtggccaggacagccagtgaagagctggcctgcctgctcat
ggattttggagcagacacccaggccaagaatgctgaaggcaaatgtcatg
tggagctggtgcctccagagagccctttgatccagctcttcttggagaga
gaagggcccccttcttttgatgcagttatgcctagaaatcagaagggctt
tggaatccagcagcatcataagataaccaaagtcgtcctcccagaggatc
tgaaatggtttctcctacatctttgtatgtatcaatggaatggattcaca
aacaatgtgaaaacattattgagtgttgtagccactagaattttaaaatc
aagttaggtttatagagtttgactagttttttcgattagatttgtattag
ttataaatttgttcatagagtttgactaattttttcgattagatttgtat
ttgttaaactctgaagccagagtttaaacacactgcatacgtttgtatga
ttagttagaaggcatgaagacttttttccctgcttggagactgtctaaaa
taacagctattgttttgcatatccactgcaggccaagcactttcagcatc
atctaattcagccctcacagcaactgggtcaatctgtccaatttcccagg
gcaaggatagaggagtcagattcaaatacaggttttctgacgttaactta
tgtgatgatttgatcaaagcaggattttccagcatcactatccttgttcc
atctctgctatatgggaatgaaaataaagaaatgtatttcaaaaaaataa
aaagaaaagaaaaacagagacggtc
>NM_016326 3
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggtgtttgcacttgtgacag
cagtatgctgtcttgccgacggggcccttatttaccggaagcttctgttc
aatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagt
tttgtaattttatattactttttagtttgatactaagtattaaacatatt
tctgtattcttccacatattttctgcagttattttaactcagtataggag
ctagaggaagagatttccgaagtctgcaccccgcgcagagcactactgta
acttccaagggagcgctgggagcagcgggatcgggttttccggcacccgg
gcctgggtggcagggaagaatgtgccgggatccgcctcagggatctttga
atctctttactgcctggctggccggcagctccg
>NM_181641 2
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact
ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
acttgatcgattaatgaagtggttattttggcctttgcttgtgtttgcac
ttgtgacagcagtatgctgtcttgccgacggggcccttatttaccggaag
cttctgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaa
aaaagaagttttgtaattttatattactttttagtttgatactaagtatt
aaacatatttctgtattcttccacatattttctgcagttattttaactca
gtataggagctagaggaagagatttccgaagtctgcaccccgcgcagagc
actactgtaacttccaagggagcgctgggagcagcgggatcgggttttcc
ggcacccgggcctgggtggcagggaagaatgtgccgggatccgcctcagg
gatctttgaatctctttactgcctggctggccggcagctccg
>NM_001144931 1
gtttccgttcctctgcccgccatgccgttcctagagctgcacacgaattt
ccccgccaaccgagtgcccgcggggctggagaaacggctgtgcgccgtcg
ctgcctccatcttgggcaaacctgcagaccttgtgaacgtgacggtacgg
ccgggcctggccagggcgctgagcgggtccaccgagccctgcgcgcagct
gtccatctcctccatcggcgtagtgggcaccgccgaggacaaccgcagcc
acagtgcccacttctttgagtttctcaccaaggagctagccctgggccag
gaccggtgcgcaggggtagtaggcccggaatattattctaaaacacaatc
agagtactccattcctgctaacagtttaaagccaaacacctaggcaggcc
atttaggcttctgaatgactgggtcttgaccaggagagctgctgtctagg
ttttctcttcctgaccagttcctcaagagaaatgcaaaactagtgattaa
cagtaagagtcaggcagggcgcggtggctcacgcctgtaatcccagcact
ttgggaggccgag
>NR_029429 1
ggacaccaccccaaaatttcctagtcctctttgatacgggttcctccaat
ctgtagctgccctccatctactgccagagccaagtctgctccaatcacaa
caggttcaatcccagcctgtcctccaccttcagaaacgatggacaaacct
atggactatcctatgggagtggcagcctgagtgtgttcctgggctatgac
actgtgactgttcataacatcgttgtcaataaccaggagtttggcctgag
tgagaatgagcccagcgaccccttttactattcagactttgacgggatcc
tgggaatggcctacccaaacatggcagaggggaattcccctacagtaatg
caggggatgctgcagcagagccagcttactcagcccgtcttcagcttcta
cttcacctgccagccaacccgccagtattgtggagagctcatccttggag
gtgtggaccccaactttattctggtcagatcatctggacccctgtcagcc
cgtaactgtactggcagattgccatcgaggaatttgccatcggtaaccag
gccactggcttgtgctctgagggttgccaggccattgtggataccgagac
cttcctgc
>NR_026551 1
tgtggcctgagaggacggccaggactggccagaaaagagagggacgtggc
taaacgtgagggggcgtggccaagatggccgcgtgcgggatcctcgggta
ccgggagcgaacgaggaggttctggctcagtgcatccactctgggagagc
gtggacctggttcctgggggcgatcgccagtcacccatcaacattcggtg
gagggacagtgtttatgatcccggcttaaaaccactgaccatctcttatg
acccagccacctgcctccacgtctggaataatgggtactctttcctcgtg
gaatttgaagattctacagataaatcagctgcacttagtgcattggaacg
cagtcaaatttgaaaactttgaggatgcagcactggaagaaaatggtttg
gctgtgataggagtatttttaaagatttcggaaacttctggcagcccagt
gtctactggaaggcccaagccgcttgccagaaagctgcgccccgcccaaa
agcactgggttctgcagtccaggcccttcctcagctcccaggtccaggag
aactgcaaggtcacctacttccacaggaagcactgggtccgcatccggcc
cctccgcaccactcctcccagctgggactacacccgcatctgcatccaga
gagagatggtccccgcccgcatccgcgtcctgagagagatggtccccgag
gcctggaggtgctttcccaacaggctgccgctgctgagcaacatcaggcc
tgatttctccaaggctcccctggcctacgtgaagcggtggctttggaccg
cccgccacccccacagcctgtccgcagcctggtgaccgtgaaaatcgccc
cgccagagagcagaggaagcccgacgcccaggccatctgccttcaggtct
gtgatgagaaacggagtggcctgttccgttgtgcccaggtctaggccgct
gagcagagccctcactcccaggcagagttgtctgaatccttcct
>NM_181640 2
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggatattatcaactcactgg
taacaacagtattcatgctcatcgtatctgtgttggcactgataccagaa
accacaacattgacagttggtggaggggtgtttgcacttgtgacagcagt
atgctgtcttgccgacggggcccttatttaccggaagcttctgttcaatc
ccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaagaagttttg
taattttatattactttttagtttgatactaagtattaaacatatttctg
tattcttccacatattttctgcagttattttaactcagtataggagctag
aggaagagatttccgaagtctgcaccccgcgcagagcactactgtaactt
ccaagggagcgctgggagcagcgggatcgggttttccggcacccgggcct
gggtggcagggaagaatgtgccgggatccgcctcagggatctttgaatct
ctttactgcctggctggccggcagctccg
>NM_016951 3
atgcgcgcaagagagcgggaagccgagctgggcgagaagtaggggagggc
ggtgctccgccgcggtggcggttgctatcgcttcgcagaacctactcagg
cagccagctgagaagagttgagggaaagtgctgctgctgggtctgcagac
gcgatggataacgtgcagccgaaaataaaacatcgccccttctgcttcag
tgtgaaaggccacgtgaagatgctgcggctggcactaactgtgacatcta
tgaccttttttatcatcgcacaagcccctgaaccatatattgttatcact
ggatttgaagtcaccgttatcttatttttcatacttttatatgtactcag
acttgatcgattaatgaagtggttattttggcctttgcttgatattatca
actcactggtaacaacagtattcatgctcatcgtatctgtgttggcactg
ataccagaaaccacaacattgacagttggtggaggggtgtttgcacttgt
gacagcagtatgctgtcttgccgacggggcccttatttaccggaagcttc
tgttcaatcccagcggtccttaccagaaaaagcctgtgcatgaaaaaaaa
gaagttttgtaattttatattactttttagtttgatactaagtattaaac
atatttctgtattcttccacatattttctgcagttattttaactcagtat
aggagctagaggaagagatttccgaagtctgcaccccgcgcagagcacta
ctgtaacttccaagggagcgctgggagcagcgggatcgggttttccggca
cccgggcctgggtggcagggaagaatgtgccgggatccgcctcagggatc
tttgaatctctttactgcctggctggccggcagctccg
>NR_002773 1
cagcaccacaccaggaccctccagaggctgtgagaaacatcctgcaccca
ggtcctctctatctgtttatcattgtctattttgtattctgcattcagaa
ccaagagcctgaagacgacccaggagctttagctatggctgtcttcatta
ttttgtccctgtttagtgttctggtgacaggcatgggtgaaggtggggct
gggagtgagaaaggaggtgagagggaatgtaagctgaaccagcttcccca
ttgcccctccgtatctcccagtgcccagccttggacacaccctggccaga
gccagctgtttgcagacctgagccgagaggagctgacggctgtgatgcgc
tttctgacccagcagctggggccagggctggtggatgcagcccaggccca
gccctcggacaactgtgtcttctcagtggagttgcagctgcctcccaagg
ctgcagccctggctcacttggacagggggagccccccacctgcccgggag
gcactggccatcgtcttctttggcaggcaaccccagcccaacgtgagtga
gctggtggtggggccactgcctcacccctcctacatgcgggacgtgactg
tggagcgtcatggaggccccctgccctatcaccgacgccccatgttgttc
caagagtacctggacatagaccagatgatcttcgacagagagctgcccca
ggcttctgggcttctccatcactgttgcttctacaagcgccggggacgga
acctggtgacaatgaccacggctccccgtggtctgcaatcaggggaccgg
gccacctagtttggcctctactacaacatctcgggcgctgggttcttcct
gcaccacgtgggcttggagctgctagtgaaccacaaggcccttgaccctg
cccgctggactatccagaaggtgttctatcaaggccgctactatgacagc
ctggcccagctggaggcccagtttgaggccggcctggtgaatgtggtgct
gatcccagacaatggcacaggtgggtcctggtccctgaagtcccctgtgc
ccccgggtccagctccccctctgcagttccatccccaaggcccccgcttc
agtgtccagggaagtcgagtggcctcctcactgtggactttctcctttgg
cctcggagcattcagtggcccaaggatctttgacgttcccttccaagggg
agagggtggcctatgaagtcagtgtccaggcggccttggccatctatgga
ggcaattctccttctgctctacgaagccggtacatagatagtggctttgg
cttgggccacttctccacgcccctgacccatggggtggactgcccctacc
tggccacctacgtggactggcacttcctttttgagtcccaggccgccaag
acaatacgcgatgccttttgtatatttgaacagaaccagggcctccccct
gcggcgacaccactcagatctctactcccactactttgggggccttgcgg
aaacggtgctggtcatcagatctgtgtctactatgctcaactatgactat
gtgtgggatatggtcttccaccctaatggggccatagaaatcagactcca
caccaccggctacatcagctcagcattcccctttggtgctgcccagaggt
atggaaacaaagtttcagagcacaccctgggcacggtccacacccacagc
gcccacttcaaggtggacctggatgtagcaggtaaggcatcctggcagag
gcaaaagtgctggaggggtgagctgaagtctccatgcctagctttaaaag
ttttcgttgggctgggagcagtagcttatgcctgtaagcccaacactttg
ggagactgaggggggtggatcacttgaggtcaggagttcaaaaccagcct
ggccaacatggcgaaatcctgtctgtactaaaaatacaaaaattagctgg
gcatgggtatgctgtaatcctagctactcgggaggctgaggcaggagaat
cacttgaatctgggagtcagaggttgcagtgagctgagattgagccactg
cactccatcctgcgtgactgaac
>NR_037806 1
attcccagtcacccactcactcagaaagccgggagtcatcggacaccttg
ctggtcagaggtcctgggggtggttttgaaccatcagagcttggactttt
ctgacttccccagcaaggatcttcccacttcctgctccctgtgttcccac
cctccagtgttggcacaggcccacccctggctccaccagagccagaagca
gaggtagaatcaggcgggccccgggctgcactccgagcagtgttcctggc
catctttgctactttcctagagaacccggctgttgccttaaatgtgtgag
agggacttggccaaggcaaaagctggggagatgccagtgacaacatacag
ttcatgactaggtttaggaattgggcactgagaaaattctcaatatttca
gagagtccttcccttatttgggactcttaacacggtatcctcgctagttg
gttttaagggaaacactctgctcctgggtgtgagcagaggctctggtctt
gccctgtggtttgactctccttagaaccaccgcccaccagaaacataaag
gattaaaatcacactaataacccctggatggtcaatctgataataggatc
agatttacgtctaccctaattcttaacattgcagctttctctccatctgc
agattattcccagtctcccagtaacacgtttctacccagatcctttttca
tttccttaagttttgatctccgtcttcctgatgaagcaggcagagctcag
aggatcttggcatcacccaccaaagttagctgaaagcagggcactcctgg
ataaagcagcttcactcaactctggggaatgctaccattttttttccaaa
gtagaaaggaagcacttctgagccagtgaccactgaaagatgaacactct
tcctgatcctctcctctagaattcatctcctcctgctagcagccgcgtcc
tggaggagcagcggatggggaatccattctgtttcttcctggtgtttagg
aagttgccccacacacagattgccccgatgtccaaccagaagaagtgaaa
ctgctgctgggtctggagaggtgaagacccgtggccagcttctgttgttg
ccatcggccattgctttttgttcgcttgcttttggttttgcaagaagagc
ggcctctgtctctgatctgcttcaaatcatcattccatcagtgacagaag
tggctgttccatcagtggtcgcagccagttcagctcctgcatccatcccc
aagtgttctgagtggaatttgaggcctccccaaccacctaccaaaaaagg
agggtgaaatgaaaggaagaagaaaaactcagcattctttcctctgacaa
agagtaaaacgacaaggaatatcggcctgaattctcttcccaagaagaaa
gaaagcacaccaacgcaggcatttgtcttctgtccatggtgctgaagttt
attcactttcaaaccactttcagtaacagcaaattctttagaaaaggaaa
atacagggaaagggataaacctcactgacttggaggaaatcaagaggagt
gagcacagcatcagaaagccccctggccccagactgcacccgctttcctg
gccctaccttgaaatccatcaggtctgcgttggacacggcattgtacatg
ggattagctctg
任何帮助和意见都将深表感谢。
感谢您抽出宝贵时间解决我的问题!
答案 0 :(得分:2)
我没有将序列分成三个部分,而是看到这个工作的方式是在整个序列中找到所有出现的$pattern
并确定模式从哪个开始。
内置变量$-[0]
包含最近成功匹配开头的偏移量。
以下代码执行我认为您想要的内容。它的工作原理是累积每个序列(在找到新的序列ID或到达文件末尾时结束)并将其传递给process_seq
子例程。
子程序获取序列的长度,并计算字符串每三分之一结束的偏移量。惯用sprintf '%.0f', $value
用于将小数值舍入到最近的字符位置。
针对序列中每次出现的@counts
调整$regex
数组。通过比较@counts
中匹配的起始位置和序列的三个段中每个段的结束偏移来建立要增加的$-[0]
元素。
处理完每个序列后,@counts
中的值会累积到@totals
中,以显示所有序列的总体数据。
显示使用样本数据时程序的输出。总计为(9, 1, 6)
。
use strict;
use warnings;
my $gpat = '[G]{3,5}';
my $npat = '[A-Z]{1,25}';
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
my $regex = qr/$pattern/i;
open my $fh, '<', 'sequences.txt' or die $!;
my ($id, $seq);
my @totals = (0, 0, 0);
while (<$fh>) {
chomp;
if (/^>(\w+)/) {
process_seq($seq) if $id;
$id = $1;
$seq = '';
print "$id\n";
}
elsif ($id) {
$seq .= $_;
process_seq($seq) if eof;
}
}
print "Total: @totals\n";
sub process_seq {
my $sequence = shift;
my $length = length $sequence;
my @offsets = map {sprintf '%.0f', $length * $_ / 3} 1..3;
my @counts = (0, 0, 0);
while ($sequence =~ /$regex/g) {
my $place = $-[0];
for my $i (0..2) {
next if $place >= $offsets[$i];
$counts[$i]++;
last;
}
}
print "@counts\n\n";
$totals[$_] += $counts[$_] for 0..2;
}
<强>输出强>
NR_037701
0 0 1
NM_198399
1 0 0
NR_026816
1 0 1
NR_027917
0 0 0
NR_002777
0 0 0
NR_033769
1 0 0
NM_016326
1 0 1
NM_181641
1 0 1
NM_001144931
0 0 0
NR_029429
0 1 0
NR_026551
1 0 0
NM_181640
1 0 1
NM_016951
1 0 1
NR_002773
1 0 0
NR_037806
0 0 0
Total: 9 1 6
答案 1 :(得分:2)
我解除了Borodin的process_seq功能,但使用了Bio:SeqIO按顺序读取文件序列,这比逐行手动读取和确定各种处理的逻辑更有优势。我相信这些优点是:
next_seq
)方法读取结果文件。我认为生物遗传密码模块的BioPerl软件包对于开始编程的生物学家来说必定是压倒性的。他可能不愿意尝试挖掘开始构建程序所需的信息。 BioPerl wiki是一个很好的起点,特别是Howto部分,然后有一个如何为初学者和其他人。你会发现大多数(?)有用的代码示例。 Bio::Seq在开头有一些很好的代码示例,并且是大多数通用序列函数的地方。此外,对于输入/输出,使用了Bio::SeqIO模块,并在其手册的开头有示例。
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $gpat = '[G]{3,5}';
my $npat = '[A-Z]{1,25}';
my $pattern = $gpat.$npat.$gpat.$npat.$gpat.$npat.$gpat;
my $regex = qr/$pattern/i;
my $in = Bio::SeqIO->new ( -file => "fasta_dat.txt",
-format => 'fasta');
my @totals;
while ( my $seq = $in->next_seq() ) {
process($seq);
}
print "Totals: ";
print "@totals\n";
sub process {
my $seq = shift;
my @offset = map {sprintf '%.0f', $seq->length * $_ / 3} 1..3;
my $sequence = $seq->seq;
my @count = (0,0,0);
while ($sequence =~ /$regex/g) {
my $place = $-[0];
for my $i (0 .. 2) {
next if $place >= $offset[$i];
$count[$i]++;
last;
}
}
print $seq->id, "\n@count\n";
$totals[$_] += $count[$_] for 0 .. $#count;
}