如何使用相同的模式perl拆分大文件并写入单个记录?

时间:2016-07-11 11:37:34

标签: perl split pattern-matching

我有一个多GB文件,包含数千个基于ID的单个文件。

每个组件文件由四个注释行组成,后跟内容。每隔一条注释行都有一个唯一的ID。我想将文件分割为由其ID命名的文件。

还有第二个size list ID和大小。我希望首先将此行写为每个输出文件中的第一行。

示例

尺寸表

A_1 100
Bxx_xx  25
P_b 342
1A_Z0   343
Z867    200
BWS 111

输入文件

# ver XX
# Query: A_1
# Database: XX
# Usage: XX
A_1 .*
A_1 .*
A_1 .*
A_1 .*
A_1 .*
# ver
# Query: Bxx_xx
# Database: XXXXXX
# Usage: XXXXX
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*
# ver
# Query: P_b
# Database: XXXXXX
# Usage: XXXXX
P_b.*
P_b.*
P_b.*
P_b.*
P_b.*
P_b.*
# ver
# Query: 1A_Z0
# Database: XXXXXX
# Usage: XXXXX
1A_Z0.*
1A_Z0.*
1A_Z0.*
1A_Z0.*
# ver
# Query: Z867
# Database: XXXXXX
# Usage: XXXXX
# ver
# Query: BWS
# Database: XXXXXX
# Usage: XXXXX
BWS.*
BWS.*
BWS.*

输出应该是这样的,(ID.txt)

A_1.txt

A_1 100
# ver XX
# Query: A_1
# Database: XX
# Usage: XX
A_1 .*
A_1 .*
A_1 .*
A_1 .*
A_1 .*

Bxx_xx.txt

Bxx_xx  25
# ver
# Query: Bxx_xx
# Database: XXXXXX
# Usage: XXXXX
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*
Bxx_xx  .*

P_b.txt

P_b 342
# ver
# Query: P_b
# Database: XXXXXX
# Usage: XXXXX
P_b.*
P_b.*
P_b.*
P_b.*
P_b.*
P_b.*

1A_Z0.txt

1A_Z0   343
# ver
# Query: 1A_Z0
# Database: XXXXXX
# Usage: XXXXX
1A_Z0.*
1A_Z0.*
1A_Z0.*
1A_Z0.*

Z867.txt

Z867    200
# ver
# Query: Z867
# Database: XXXXXX
# Usage: XXXXX

BWS.txt

BWS 200
# ver
# Query: BWS
# Database: XXXXXX
# Usage: XXXXX
BWS.*
BWS.*
BWS.*

在某些情况下,四行后可能没有内容。例如,

# ver
# Query: Z867
# Database: XXXXXX
# Usage: XXXXX

我仍然希望将它们作为新文件Z867.txt

我的代码如下

while ( $line = <BOF> ) {

    chomp $line;
    $cpline = $line;

    next if ( $cpline =~ /^Query/ );

    if ( $cpline =~ /^#\sQuery\:\s(\w.*)/ ) {

        $query = $1;

        foreach $sizeLine (@sizeList) {

            $sizeLine =~ /^(\w.*)\t(\d+)$/;
            $seqId  = $1;
            $seqLen = $2;

            if ( $seqId eq $query ) {
                print "Query\t$seqLen\n";
            }
        }
    }

    $cpline = "";

    if ( $line =~ /^#/ ) {
        print "$line\n";
    }

    if ( $line !~ /^#/ ) {

        if ( $line =~ /^((.+)\_.+)\t((.+)\_.+)\t(.+)\t(.+)\t.+\t.+\t.+\t.+\t.+\t.+\t.+\t\s?.+$/ ) {

            $queryId = $1;

            if ( $seqId eq $queryId ) {
                print "$line\n";
            }
        }
    }
}

2 个答案:

答案 0 :(得分:3)

我对你的要求感到困惑,因为你的Perl代码似乎与你的问题所描述的有很大不同。但是,这是一个简单的解决方案,可以为评论中的每个# Query:行打开一个新文件,并生成您想要的输出

此程序需要输入文件的路径作为命令行上的参数

use strict;
use warnings 'all';
use autodie;

my $out_fh;
my @header;

while ( <> ) {

    if ( /^#/ ) {

        push @header, $_;

        if ( /Query:\s*(\S+)/ ) {
            my $file = "$1.txt";
            print qq{Creating "$file"\n};
            open $out_fh, '>', $file;
        }

        if ( @header == 4 ) {
            print $out_fh @header;
            @header = ();
        }
    }
    elsif ( $out_fh ) {
        print $out_fh $_;
    }
}

close $out_fh;

输出

Creating "A_1.txt"
Creating "Bxx_xx.txt"
Creating "P_b.txt"
Creating "1A_Z0.txt"
Creating "Z867.txt"
Creating "BWS.txt"


更新

这是我的代码的新版本,符合您修订的规范。 (请不要这样做。)

use strict;
use warnings 'all';
use autodie;

@ARGV = qw/ 4l.txt size_list.txt /;

my ( $input, $size_list ) = @ARGV;

my %sizes;
{
    open my $fh, '<', $size_list;
    while ( <$fh> ) {
        my ($file, $size) = split;
        $sizes{$file} = $size if defined $size;
    }
}


my $out_fh;
my @header;

while ( <> ) {

    if ( /^#/ ) {

        push @header, $_;

        if ( /Query:\s*(\S+)/ ) {

            my $id = $1;
            my $size = $sizes{$id};
            die qq{No size found for ID "$id"} unless defined $size;
            my $file = "$id.txt";

            print qq{Creating "$file"\n};

            open $out_fh, '>', $file;
            print $out_fh "$id\t$size\n";
        }

        if ( @header == 4 ) {
            print $out_fh @header;
            @header = ();
        }
    }
    elsif ( $out_fh ) {
        print $out_fh $_;
    }
}

close $out_fh if $out_fh;

答案 1 :(得分:0)

代码:

use strict;
use warnings 'all';

my $filename = "t1";#provide your input file name
open FH, $filename or die "Error\n";
my $prev_line;
while(my $line =<FH>)
{
        chomp($line);
        if($line =~ /#\sver/)
        {
                if($. != 1)
                {
                        close(FH2);
                }
                $prev_line =$line;
        }
        elsif($line =~ /#\sQuery:(.*)/)
        {
                my $id =$1;
                $id =~ s/\s//;
                print "$id\n";
                open FH2,">$id.txt" or die "Error";
                print FH2 "$prev_line\n$line\n";

        }
        else
        {
                print FH2 "$line\n";
        }

}
close(FH);

希望这会对你有所帮助。