Question

我从genbank文件中提取了一个序列，该文件由包含60个碱基的单行字符串组成（末尾有一个\ n）。如何使用perl修改序列，以便使用正则表达式而不是bioperl为每行打印120个碱基。原始格式：

    1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
   61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
  121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
  181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
  241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat
  301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac
  361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc
  421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat
  481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat

我只设法将它们变成长度为60个字符的字符串。还在试图弄清楚如何让它们长达120个字符。

my @lines= <$FH_IN>;
foreach my $line (@lines) {
    if ($line=~ m/(^\s*\d+\s)[acgt]{10}\s/) {
            $line=~ s/$1//;
            $line=~ s/ //g;
            print $line;
    }

}

输入示例：

agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcgg
agacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggtt
cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggcc
atccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat

每个单行字符串有60个碱基。

更新（仍未给出长度为120个碱基的seq线）：

my @seq_60;
foreach my $line (@lines) {
        if ($line=~ m/(^\s*\d+\s)[acgt]{10}\s/) {
                $line=~ s/$1//;
                $line=~ s/ //g;
                push (@seq_60, $line);
        }
}

my @output;
for (my $pos= 0; $pos< @seq_60; $pos+= 2) {
        push (@output, $seq_60[$pos] . $seq_60[$pos+1]);
}

print @output;

Answer 1

怎么样：

s/(^|\n)([^\n]{60})\n/$1$2/g

在行动中：

use strict;
use warnings;
use 5.014;

my $str = q/agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcgg
agacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggtt
cgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
gcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggcc
atccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat/;

$str =~ s/(^|\n)([^\n]{60})\n/$1$2/g;
say $str;

<强>输出：

agatggcggcgctgaggggtcttgggggctctaggccggccacctactggtttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcctgggaggcgtgactagaagcg
gaagtagttgtgggcgcctttgcaaccgcctgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggtcgtgagggagtgcgccgggagcggagatatggagggagat
ggttcagacccagagcctccagatgccggggaggacagcaagtccgagaatggggagaatgcgcccatctactgcatctgccgcaaaccggacatcaactgcttcatgatcgggtgtgac
aactgcaatgagtggttccatggggactgcatccggatcactgagaagatggccaaggccatccgggagtggtactgtcgggagtgcagagagaaagaccccaagctagagattcgctat

<强>解释

(^|\n)      : group 1, start of string or line break
(           : start group 2
  [^\n]{60} : anything that is not a line break 60 times
)           : end group 2
\n          : line break

根据评论进行编辑：

按行加入行：

my @out;
for (my $i = 0; $i < @arr; $i += 2) {
    chomp($in[$i]);
    push @out, $in[$i] . $in[$i+1];
}

Answer 2

您可以同时读取和写入行，并将前一行存储在变量中。有关正在发生的事情的解释，请参阅代码注释：

my $prev;
while (<$FH_IN>) {
    next unless /\w/; # make sure the lines have some content
    # remove the line endings
    chomp;
    # chop off the first 6 characters (the base numbers) - format is 4 chars that
    # can be numbers or spaces, a digit, and a space
    $_ =~ s/^[\s\d]{4}\d\s//g;
    # remove the spaces between bases
    $_ =~ s/\s//g;
    # have we got a saved line?
    if ($prev) {
        # print out saved line and this line
        print $prev . $_ . "\n";
        # delete the saved line $prev
        $prev = '';
    }
    else {
        # if we don't have a saved line, save this line
        $prev = $_;
    }
}

使用正则表达式更改单行字符串的字符长度

2 个答案: