我从genbank文件中提取了一个序列,该文件由包含60个碱基的单行字符串组成(末尾有一个\ n)。如何使用perl修改序列,以便使用正则表达式而不是bioperl为每行打印120个碱基。 原始格式:
1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat
301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac
361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc
421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat
481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat
我只设法将它们变成长度为60个字符的字符串。还在试图弄清楚如何让它们长达120个字符。
my @lines= <$FH_IN>;
foreach my $line (@lines) {
if ($line=~ m/(^\s*\d+\s)[acgt]{10}\s/) {
$line=~ s/$1//;
$line=~ s/ //g;
print $line;
}
}
输入示例:
agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcgg
agacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggtt
cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggcc
atccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat
每个单行字符串有60个碱基。
更新(仍未给出长度为120个碱基的seq线):
my @seq_60;
foreach my $line (@lines) {
if ($line=~ m/(^\s*\d+\s)[acgt]{10}\s/) {
$line=~ s/$1//;
$line=~ s/ //g;
push (@seq_60, $line);
}
}
my @output;
for (my $pos= 0; $pos< @seq_60; $pos+= 2) {
push (@output, $seq_60[$pos] . $seq_60[$pos+1]);
}
print @output;
答案 0 :(得分:0)
怎么样:
s/(^|\n)([^\n]{60})\n/$1$2/g
在行动中:
use strict;
use warnings;
use 5.014;
my $str = q/agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcgg
agacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggtt
cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggcc
atccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat/;
$str =~ s/(^|\n)([^\n]{60})\n/$1$2/g;
say $str;
<强>输出:强>
agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggccatccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat
<强>解释强>
(^|\n) : group 1, start of string or line break
( : start group 2
[^\n]{60} : anything that is not a line break 60 times
) : end group 2
\n : line break
根据评论进行编辑:
按行加入行:
my @out;
for (my $i = 0; $i < @arr; $i += 2) {
chomp($in[$i]);
push @out, $in[$i] . $in[$i+1];
}
答案 1 :(得分:0)
您可以同时读取和写入行,并将前一行存储在变量中。有关正在发生的事情的解释,请参阅代码注释:
my $prev;
while (<$FH_IN>) {
next unless /\w/; # make sure the lines have some content
# remove the line endings
chomp;
# chop off the first 6 characters (the base numbers) - format is 4 chars that
# can be numbers or spaces, a digit, and a space
$_ =~ s/^[\s\d]{4}\d\s//g;
# remove the spaces between bases
$_ =~ s/\s//g;
# have we got a saved line?
if ($prev) {
# print out saved line and this line
print $prev . $_ . "\n";
# delete the saved line $prev
$prev = '';
}
else {
# if we don't have a saved line, save this line
$prev = $_;
}
}