我有一个多GB文件,包含数千个基于ID的单个文件。
每个组件文件由四个注释行组成,后跟内容。每隔一条注释行都有一个唯一的ID。我想将文件分割为由其ID命名的文件。
还有第二个size list
ID和大小。我希望首先将此行写为每个输出文件中的第一行。
示例
A_1 100
Bxx_xx 25
P_b 342
1A_Z0 343
Z867 200
BWS 111
# ver XX
# Query: A_1
# Database: XX
# Usage: XX
A_1 .*
A_1 .*
A_1 .*
A_1 .*
A_1 .*
# ver
# Query: Bxx_xx
# Database: XXXXXX
# Usage: XXXXX
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
# ver
# Query: P_b
# Database: XXXXXX
# Usage: XXXXX
P_b.*
P_b.*
P_b.*
P_b.*
P_b.*
P_b.*
# ver
# Query: 1A_Z0
# Database: XXXXXX
# Usage: XXXXX
1A_Z0.*
1A_Z0.*
1A_Z0.*
1A_Z0.*
# ver
# Query: Z867
# Database: XXXXXX
# Usage: XXXXX
# ver
# Query: BWS
# Database: XXXXXX
# Usage: XXXXX
BWS.*
BWS.*
BWS.*
输出应该是这样的,(ID.txt)
A_1 100
# ver XX
# Query: A_1
# Database: XX
# Usage: XX
A_1 .*
A_1 .*
A_1 .*
A_1 .*
A_1 .*
Bxx_xx 25
# ver
# Query: Bxx_xx
# Database: XXXXXX
# Usage: XXXXX
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
Bxx_xx .*
P_b 342
# ver
# Query: P_b
# Database: XXXXXX
# Usage: XXXXX
P_b.*
P_b.*
P_b.*
P_b.*
P_b.*
P_b.*
1A_Z0 343
# ver
# Query: 1A_Z0
# Database: XXXXXX
# Usage: XXXXX
1A_Z0.*
1A_Z0.*
1A_Z0.*
1A_Z0.*
Z867 200
# ver
# Query: Z867
# Database: XXXXXX
# Usage: XXXXX
BWS 200
# ver
# Query: BWS
# Database: XXXXXX
# Usage: XXXXX
BWS.*
BWS.*
BWS.*
在某些情况下,四行后可能没有内容。例如,
# ver
# Query: Z867
# Database: XXXXXX
# Usage: XXXXX
我仍然希望将它们作为新文件Z867.txt
我的代码如下
while ( $line = <BOF> ) {
chomp $line;
$cpline = $line;
next if ( $cpline =~ /^Query/ );
if ( $cpline =~ /^#\sQuery\:\s(\w.*)/ ) {
$query = $1;
foreach $sizeLine (@sizeList) {
$sizeLine =~ /^(\w.*)\t(\d+)$/;
$seqId = $1;
$seqLen = $2;
if ( $seqId eq $query ) {
print "Query\t$seqLen\n";
}
}
}
$cpline = "";
if ( $line =~ /^#/ ) {
print "$line\n";
}
if ( $line !~ /^#/ ) {
if ( $line =~ /^((.+)\_.+)\t((.+)\_.+)\t(.+)\t(.+)\t.+\t.+\t.+\t.+\t.+\t.+\t.+\t\s?.+$/ ) {
$queryId = $1;
if ( $seqId eq $queryId ) {
print "$line\n";
}
}
}
}
答案 0 :(得分:3)
我对你的要求感到困惑,因为你的Perl代码似乎与你的问题所描述的有很大不同。但是,这是一个简单的解决方案,可以为评论中的每个# Query:
行打开一个新文件,并生成您想要的输出
此程序需要输入文件的路径作为命令行上的参数
use strict;
use warnings 'all';
use autodie;
my $out_fh;
my @header;
while ( <> ) {
if ( /^#/ ) {
push @header, $_;
if ( /Query:\s*(\S+)/ ) {
my $file = "$1.txt";
print qq{Creating "$file"\n};
open $out_fh, '>', $file;
}
if ( @header == 4 ) {
print $out_fh @header;
@header = ();
}
}
elsif ( $out_fh ) {
print $out_fh $_;
}
}
close $out_fh;
Creating "A_1.txt"
Creating "Bxx_xx.txt"
Creating "P_b.txt"
Creating "1A_Z0.txt"
Creating "Z867.txt"
Creating "BWS.txt"
这是我的代码的新版本,符合您修订的规范。 (请不要这样做。)
use strict;
use warnings 'all';
use autodie;
@ARGV = qw/ 4l.txt size_list.txt /;
my ( $input, $size_list ) = @ARGV;
my %sizes;
{
open my $fh, '<', $size_list;
while ( <$fh> ) {
my ($file, $size) = split;
$sizes{$file} = $size if defined $size;
}
}
my $out_fh;
my @header;
while ( <> ) {
if ( /^#/ ) {
push @header, $_;
if ( /Query:\s*(\S+)/ ) {
my $id = $1;
my $size = $sizes{$id};
die qq{No size found for ID "$id"} unless defined $size;
my $file = "$id.txt";
print qq{Creating "$file"\n};
open $out_fh, '>', $file;
print $out_fh "$id\t$size\n";
}
if ( @header == 4 ) {
print $out_fh @header;
@header = ();
}
}
elsif ( $out_fh ) {
print $out_fh $_;
}
}
close $out_fh if $out_fh;
答案 1 :(得分:0)
代码:
use strict;
use warnings 'all';
my $filename = "t1";#provide your input file name
open FH, $filename or die "Error\n";
my $prev_line;
while(my $line =<FH>)
{
chomp($line);
if($line =~ /#\sver/)
{
if($. != 1)
{
close(FH2);
}
$prev_line =$line;
}
elsif($line =~ /#\sQuery:(.*)/)
{
my $id =$1;
$id =~ s/\s//;
print "$id\n";
open FH2,">$id.txt" or die "Error";
print FH2 "$prev_line\n$line\n";
}
else
{
print FH2 "$line\n";
}
}
close(FH);
希望这会对你有所帮助。