我有很长的文本文件,我想在电子表格中进行转换。它由Id,Name,Length和sequence组成。每个新蛋白质以(>)符号开头,顺序为Id,名称长度和新行上的序列
实施例
1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide
KRISTTITTTITITTGNGAG
2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
输出
表格
Id Length Name Sequence
LPT_ECOLI 90-255(Clockwisw) Thr operon lader peptide KRISTTITTT
答案 0 :(得分:2)
有一个有点尴尬的sed
脚本:
sed -nE '/^[0-9]+[ \t]+>/ { s/^[0-9]+[ \t]+>[ \t]+//; h; n; x; G; s/\n/,/; s/[ \t]*,[ \t]*/,/g; p }'
输出:
LPT_ECOLI,190-255 (Clockwise),Thr operon leader peptide,KRISTTITTTITITTGNGAG
AK1H_ECOLI,337-2799 (Clockwise),Bifunctional aspartokinase/homoserine dehydrogenase I,MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
您可以在电子表格中将其导入为CSV格式。
编辑:如果您坚持使用Perl也是如此:
perl -lpe 'chomp($_ .= "," . <>) if (s/^\d+\s*>\s*//o); s/\s*,\s*/,/g'
答案 1 :(得分:2)
如果您的IDS
是唯一的,那么您可以按照自己的意愿行事:
my ($id, $length, $name, $sequence);
my %data;
while(<DATA>){
chomp;
my @split = split(/,/);
($id, $length, $name) = @split[0..2] if /^\d+/;
$id =~ s/^\d+\s>\s//;
$data{$id} = [$name, $length, $_] if /^[A-Z]/;
}
open my $out, '>', 'out.csv' or die $!;
print $out "Id,Length,Name,Sequence\n";
foreach my $id (sort keys %data){
($length, $name, $sequence) = @{$data{$id}};
print $out "$id,$length,$name,$sequence\n";
}
__DATA__
1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide
KRISTTITTTITITTGNGAG
2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
这可以通过在,
上拆分数据并构建数组哈希,使用ids作为键,其他信息作为值来实现。然后可以将其打印到.csv
文件。
答案 2 :(得分:2)
这是另一种选择:
use strict;
use warnings;
while ( my $lines = <DATA> . <DATA> ) {
print join (',', ( split />\s+|,\s+|\n/, $lines )[ 1 .. 4 ]), "\n";
}
__DATA__
1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide
KRISTTITTTITITTGNGAG
2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
输出:
LPT_ECOLI,190-255 (Clockwise),Thr operon leader peptide,KRISTTITTTITITTGNGAG
AK1H_ECOLI,337-2799 (Clockwise),Bifunctional aspartokinase/homoserine dehydrogenase I,MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
while
循环从一次读取两行开始。 split
使用正则表达式将这些行拆分为“&gt;”或“,”或“\ n”,然后来自join
的{{1}}个元素1-4,逗号和split
结果。
希望这有帮助!
答案 3 :(得分:0)
在Perl中:
#!/usr/bin/perl
use strict; use warnings;
open(my $fh, "<", "foo.data") || die;
my $last_was_rec_start = 0;
my ($id, $len, $name);
foreach (my $lineno=1; my $line = <$fh>; $lineno++ ) {
chomp($line);
if ($last_was_rec_start) {
# Add validation that line matches protein sequence?
print "${id},${len},${name}',$line\n";
$last_was_rec_start = 0;
next;
}
my @fields = split(/,\s+/, $line);
unless (scalar(@fields) == 3) {
print STDERR "Malformed line ${lineno}; expecting 3 comma-delimited fields:\n${line}\n";
next;
};
$len = $fields[1];
$name = $fields[2];
unless ($fields[0] =~ /\d+ > (.*)/) {
print STDERR "Malformed line ${lineno}; expecting number >\n${line}\n";
next;
}
$last_was_rec_start = 1;
$id = $1;
}
在您的示例中给出了此输出:
LPT_ECOLI,190-255 (Clockwise),Thr operon leader peptide',KRISTTITTTITITTGNGAG
AK1H_ECOLI,337-2799 (Clockwise),Bifunctional aspartokinase/homoserine dehydrogenase I',MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
基本上,代码通过读取逗号或&#34;,&#34;上的行来开始。第一个字段匹配以查找删除数字&gt;。找到线后匹配 将后面的行作为序列行。
但是,您可能还想查看Bio::Perl。它可能可以写入CSV文件,如果你的输入采用某种标准格式,它也可以读取它。
答案 4 :(得分:0)
请在下面找到示例代码 - 使用<DATA>
替换<STDIN>
并使用script < input-file > output-file
use strict; use warnings;
# print CSV header line
print "N, Id, Length, Name, Sequence\n";
my($line1,$line2);
while( defined($line1=<DATA>) and defined($line2=<DATA>)) {
# put two input lines slurped above into $_
local $_ = $line1 . $line2;
my ($N, $Id, $Length, $Name, $Sequence ) = m{
^(\d{1,6}) # $N - record numer (?)
\x20>\x20
([A-Z1-9_]{1,128}?) # $Id
\x20*,\x20*
([- ()0-9A-Za-z]{1,128}?) # Length
\x20*,\x20*
([^,\"\'\n\r]{1,256}?) # $Name
# the quotes (\"\') are escaped/backslashed to make SO syntax coloring work
\x20*\r?\n
([A-Z]{1,4096}?) # $Sequence
\r?\n
}sox or die "wrong line format (line $.):\n $_";
printf "%d, %s, %s, %s, %s\n", $N, $Id, $Length, $Name, $Sequence;
}
die if defined($line1); # incoplete set of input lines;
__DATA__
1 > LPT_ECOLI, 190-255 (Clockwise), Thr operon leader peptide
KRISTTITTTITITTGNGAG
2 > AK1H_ECOLI, 337-2799 (Clockwise), Bifunctional aspartokinase/homoserine dehydrogenase I
MRVLKFGGTSVANAERFLRVADILESNARQGQVATVLSAP
style=" height: 400px; overflow-y: scroll;"