将FILE1值与FILE2范围进行比较并打印匹配

时间:2014-10-17 10:16:51

标签: perl

我是Perl的新手,正在大学从事生物信息学项目。我的FILE1包含一个位置列表,格式为:

99269
550
100
126477 
1700

和FILE2的格式为:

517 1878 forward
700 2500 forward
2156 3289 forward
99000 100000 forward
22000 23000 backward 

我想将FILE1中的每个位置与FILE2中值的每个范围进行比较,如果一个位置属于其中一个范围,那么我想打印位置,范围和方向。

所以我的预期输出是:

99269 99000 100000 forward
550 517 1878 forward
1700 517 1878 forward 

目前它将运行没有错误,但它不输出任何信息,所以我不确定我哪里出错了!当我分割最终的'if'规则时它会运行,但只有当该位置与该范围完全相同时才会起作用。

我的代码如下:

#!/usr/bin/perl

use strict;
use warnings;

my $outputfile = "/Users/edwardtickle/Documents/CC22CDS.txt";

open FILE1, "/Users/edwardtickle/Documents/CC22positions.txt"
    or die "cannot open > CC22: $!";

open FILE2, "/Users/edwardtickle/Documents/CDSpositions.txt"
    or die "cannot open > CDS: $!";

open( OUTPUTFILE, ">$outputfile" ) or die "Could not open output file: $! \n";

while (<FILE1>) {
    if (/^(\d+)/) {
        my $CC22 = $1;

        while (<FILE2>) {
            if (/^(\d+)\s+(\d+)\s+(\S+)/) {
                my $CDS1 = $1;
                my $CDS2 = $2;
                my $CDS3 = $3;

                if ( $CC22 > $CDS1 && $CC22 < $CDS2 ) {
                    print OUTPUTFILE "$CC22 $CDS1 $CDS2 $CDS3\n";
                }
            }
        }
    }
}

close(FILE1);
close(FILE2);

我已发布same question on Perlmonks

3 个答案:

答案 0 :(得分:2)

因为只与FILE1

的第一行比较后才读取FILE2

将后续行与已关闭的文件进行比较

将FILE1中的行存入数组,然后将FILE2中的每一行与每个数组条目进行比较,如下所示

#!/usr/bin/perl

use strict;
use warnings;

my $outputfile = "out.txt";

open FILE1, "file1.txt"
    or die "cannot open > CC22: $!";

open FILE2, "file2.txt"
    or die "cannot open > CDS: $!";

open( OUTPUTFILE, ">$outputfile" ) or die "Could not open output file: $! \n";
my @file1list = ();

while (<FILE1>) {
    if (/^(\d+)/) {
        push @file1list, $1;
    }
}

while (<FILE2>) {
    if (/^(\d+)\s+(\d+)\s+(\S+)/) {
        my $CDS1 = $1;
        my $CDS2 = $2;
        my $CDS3 = $3;

        for my $CC22 (@file1list) {
            if ( $CC22 > $CDS1 && $CC22 < $CDS2 ) {
                print OUTPUTFILE "$CC22 $CDS1 $CDS2 $CDS3\n";
            }
        }
    }
}

(程序也存在风格问题(比如变量的大写字母),但我忽略了这些,这对初学者来说是一个非常好的程序)

答案 1 :(得分:0)

我认为我可以通过使用split而不是regex来简化其中一些,但我认为我的代码实际上更长,更难以阅读!无论如何,请记住,拆分适用于这样的问题:

# User config area
my $positions_file = 'input_positions.txt';
my $ranges_file    = 'input_ranges.txt';
my $output_file    = 'output_data.txt';

# Reading data
open my $positions_fh, "<", $positions_file;
open my $ranges_fh,    "<", $ranges_file;
chomp( my @positions = <$positions_fh> );
# Store the range data in an array containing hash tables
my @range_data;
# to be used like $range_data[0] = {start => $start, end => $end, dir => $dir}
while (<$ranges_fh>) {
    chomp;
    my ( $start, $end, $dir ) = split;    #splits $_ according to whitespace
    push @range_data, { start => $start, end => $end, dir => $dir };
    #print "start: $start, end: $end, direction: $dir\n";
}    #/while
close $positions_fh;
close $ranges_fh;

# Data processing:
open my $output_fh, ">", $output_file;
#It feels like it should be more efficient to process one range at a time for all data points
foreach my $range (@range_data) {    #start one range at a time
                                     #each $range = $range_data[#] = { hash table }
    foreach my $position (@positions) {    #check all positions
        if ( ( $range->{start} <= $position ) and ( $position <= $range->{end} ) ) {
            my $output_string = "$position " . $range->{start} . " " . $range->{end} . " " . $range->{dir} . "\n";
            print $output_fh $output_string;
        }                                  #/if
    }    #/foreach position
}    #/foreach range

close $output_fh;

如果在读取范围数据的while循环期间完成数据处理,则此代码可能会运行得更快。

答案 2 :(得分:0)

您的错误是因为您正在嵌入文件处理,因此您的内部循环只会一次浏览文件的内容,然后卡在eof

最简单的解决方案就是首先将内部循环文件完全加载到内存中。

以下演示了使用更多Modern Perl技术:

#!/usr/bin/perl
use strict;
use warnings;
use autodie;

my $cc22file = "/Users/edwardtickle/Documents/CC22positions.txt";
my $cdsfile = "/Users/edwardtickle/Documents/CDSpositions.txt";
my $outfile = "/Users/edwardtickle/Documents/CC22CDS.txt";

my @ranges = do {
    # open my $fh, '<', $cdsfile;   # Using Fake Data instead below
    open my $fh, '<', \ "517 1878 forward\n700 2500 forward\n2156 3289 forward\n99000 100000 forward\n22000 23000 backward\n";
    map {[split]} <$fh>;
};

# open my $infh, '<', $cc22file;   # Using Fake Data instead below
open my $infh, '<', \ "99269\n550\n100\n126477\n1700\n";

# open my $outfh, '>', $outfile;   # Using STDOUT instead below
my $outfh = \*STDOUT;

CC22:
while (my $cc22 = <$infh>) {
    chomp $cc22;

    for my $cds (@ranges) {
        if ($cc22 > $cds->[0] && $cc22 < $cds->[1]) {
            print $outfh "$cc22 @$cds\n";
            next CC22;
        }
    }

    # warn "$cc22 No match found\n";
}

输出:

99269 99000 100000 forward
550 517 1878 forward
1700 517 1878 forward

Live Demo