使用perl将文本文件转换为csv

时间:2015-06-27 10:18:08

标签: perl csv

我有很长的文本文件,我想在电子表格中进行转换。它由Id,Name,Length和sequence组成。每个新蛋白质以(>)符号开头,顺序为Id,名称长度和新行上的序列

实施例

1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide 
KRISTTITTTITITTGNGAG
2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP

输出

表格

Id Length Name Sequence
LPT_ECOLI 90-255(Clockwisw) Thr operon lader peptide KRISTTITTT

5 个答案:

答案 0 :(得分:2)

有一个有点尴尬的sed脚本:

sed -nE '/^[0-9]+[ \t]+>/ { s/^[0-9]+[ \t]+>[ \t]+//; h; n; x; G; s/\n/,/; s/[ \t]*,[ \t]*/,/g; p }'

输出:

LPT_ECOLI,190-255 (Clockwise),Thr operon leader peptide,KRISTTITTTITITTGNGAG
AK1H_ECOLI,337-2799 (Clockwise),Bifunctional aspartokinase/homoserine dehydrogenase I,MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP

您可以在电子表格中将其导入为CSV格式。

编辑:如果您坚持使用Perl也是如此:

perl -lpe 'chomp($_ .= "," . <>) if (s/^\d+\s*>\s*//o); s/\s*,\s*/,/g'

答案 1 :(得分:2)

如果您的IDS是唯一的,那么您可以按照自己的意愿行事:

my ($id, $length, $name, $sequence);
my %data;
while(<DATA>){
    chomp;
    my @split = split(/,/); 
    ($id, $length, $name) = @split[0..2] if /^\d+/;
    $id =~ s/^\d+\s>\s//;
    $data{$id} = [$name, $length, $_] if /^[A-Z]/;  
}


open my $out, '>', 'out.csv' or die $!;
print $out "Id,Length,Name,Sequence\n";

foreach my $id (sort keys %data){
    ($length, $name, $sequence) = @{$data{$id}};
    print $out "$id,$length,$name,$sequence\n";

}

__DATA__
1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide 
KRISTTITTTITITTGNGAG
2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP

这可以通过在,上拆分数据并构建数组哈希,使用ids作为键,其他信息作为值来实现。然后可以将其打印到.csv文件。

答案 2 :(得分:2)

这是另一种选择:

use strict;
use warnings;

while ( my $lines = <DATA> . <DATA> ) {
    print join (',', ( split />\s+|,\s+|\n/, $lines )[ 1 .. 4 ]), "\n";
}

__DATA__
1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide
KRISTTITTTITITTGNGAG
2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP

输出:

LPT_ECOLI,190-255 (Clockwise),Thr operon leader peptide,KRISTTITTTITITTGNGAG
AK1H_ECOLI,337-2799 (Clockwise),Bifunctional aspartokinase/homoserine dehydrogenase I,MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP

while循环从一次读取两行开始。 split使用正则表达式将这些行拆分为“&gt;”或“,”或“\ n”,然后来自join的{​​{1}}个元素1-4,逗号和split结果。

希望这有帮助!

答案 3 :(得分:0)

在Perl中:

#!/usr/bin/perl
use strict; use warnings;
open(my $fh, "<", "foo.data") || die;
my $last_was_rec_start = 0;
my ($id, $len, $name);
foreach (my $lineno=1; my $line = <$fh>; $lineno++ ) {
    chomp($line);
    if ($last_was_rec_start) {
        # Add validation that line matches protein sequence?
        print "${id},${len},${name}',$line\n";
        $last_was_rec_start = 0;
        next;
    }
    my @fields = split(/,\s+/, $line);
    unless (scalar(@fields) == 3) {
        print STDERR "Malformed line ${lineno}; expecting 3 comma-delimited fields:\n${line}\n";
        next;
    };
    $len = $fields[1];
    $name = $fields[2];
    unless ($fields[0] =~ /\d+ > (.*)/) {
        print STDERR "Malformed line ${lineno}; expecting number >\n${line}\n";
        next;
    }
    $last_was_rec_start = 1;
    $id = $1;
}

在您的示例中给出了此输出:

LPT_ECOLI,190-255 (Clockwise),Thr operon leader peptide',KRISTTITTTITITTGNGAG
AK1H_ECOLI,337-2799 (Clockwise),Bifunctional aspartokinase/homoserine dehydrogenase I',MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP

基本上,代码通过读取逗号或&#34;,&#34;上的行来开始。第一个字段匹配以查找删除数字&gt;。找到线后匹配 将后面的行作为序列行。

但是,您可能还想查看Bio::Perl。它可能可以写入CSV文件,如果你的输入采用某种标准格式,它也可以读取它。

答案 4 :(得分:0)

请在下面找到示例代码 - 使用<DATA>替换<STDIN>并使用script < input-file > output-file

执行wersion替换use strict; use warnings; # print CSV header line print "N, Id, Length, Name, Sequence\n"; my($line1,$line2); while( defined($line1=<DATA>) and defined($line2=<DATA>)) { # put two input lines slurped above into $_ local $_ = $line1 . $line2; my ($N, $Id, $Length, $Name, $Sequence ) = m{ ^(\d{1,6}) # $N - record numer (?) \x20>\x20 ([A-Z1-9_]{1,128}?) # $Id \x20*,\x20* ([- ()0-9A-Za-z]{1,128}?) # Length \x20*,\x20* ([^,\"\'\n\r]{1,256}?) # $Name # the quotes (\"\') are escaped/backslashed to make SO syntax coloring work \x20*\r?\n ([A-Z]{1,4096}?) # $Sequence \r?\n }sox or die "wrong line format (line $.):\n $_"; printf "%d, %s, %s, %s, %s\n", $N, $Id, $Length, $Name, $Sequence; } die if defined($line1); # incoplete set of input lines; __DATA__ 1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide KRISTTITTTITITTGNGAG 2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
 style=" height: 400px; overflow-y: scroll;"