检测我的数据文件中的重叠开始和停止位置并将它们分开

时间:2016-06-27 16:27:45

标签: perl

我正在使用的这个脚本旨在查看文件并检测重叠时刻,所有重叠位置都放入新文件中。

第2列可以被认为是一个起始位置,第2列可以看作是一个停止位置所以在

的例子中
2 6 10 5
2 9 13 5
3 8 9 5

前两行会重叠,因为第二行从9开始,第一行以10结束,因此它们将在9-10位置重叠。 第三行没有重叠,但因为第一列中的数字是3而不是2,这是必要的标准。

现在您已了解重叠的含义

ARGV [0]中的输入是一个类似

的文件
9   9000000 14100000    23
9   9000000 32800000    4
9   9000000 40200000    6
9   9000000 42400000    5
9   89600000    116700000   28
9   89600000    129300000   8
9   89600000    140273252   52
S   0   24900000    2
S   0   24900000    22
S   0   37500000    2
S   40000000 45000000 7
S   42500000 47000000 9

鉴于此文件ARGV [1]到最后将结束其中

9   9000000 14100000    23
9   89600000    116700000   28
S   0   24900000    2
S   40000000 45000000 7

这(我稍后将调用z.txt)将传递给标准输出

9   9000000 32800000    4
9   9000000 40200000    6
9   9000000 42400000    5
9   89600000    129300000   8
9   89600000    140273252   52
S   0   24900000    22
S   0   37500000    2
S   42500000 47000000 9

ARGV [3]本质上是一个wc -l #file | awk' {print $ 1}' ARGV [0]是

的文件

这是代码

#!/usr/bin/perl
# ARGV[0] is the name of the file which data will be read from(may have overlaps)
# ARGV[1] is the name of the file which will be produced that will have no overlaps
# ARGV[2] is the name of a directory
# ARGV[3] is the number of lines that ARGV[0] will contain
#The purpose of this script is to look through the data file and if there are overlaps then another layer is created
use warnings;
#use strict;

#Here I am just trying to open up my file in order to read from it
my $file = "./$ARGV[0]";
my @lines = do {
    open my $fh, '<', $file or die "Can't open $file -- $!";
    <$fh>;
};

#Here I am assignning a secon file that will contain the overlaps
my $file2 = "./$ARGV[2]/$ARGV[1]";
open(my $fh, ">", "$file2")
        or die "Can't open > $file2: $!";

# For each element compare all following ones, but cut out
# as soon as there's no overlap since data is sorted
my $i = 0;
while ($i < $ARGV[3]) {
        my @ref_fields = split('\s+', $lines[$i]);
#This line is printed to the file handle because it shouldn't have any overlaps so everyline in this file will not overlap with any other
        print $fh "$ref_fields[0]", "\t", $ref_fields[1], "\t", $ref_fields[2], "\t", $ref_fields[3], "\n";
#The script then looks at the lines following the line just looked at
        for my $j ($i+1..$ARGV[3]) {
                my @curr_fields = split /\s+/, $lines[$j];
#if the line does overlap then print it to standard output
                if ( $ref_fields[2] > $curr_fields[1] ) {
                        print $curr_fields[0], "\t", $curr_fields[1], "\t", $curr_fields[2], "\t", $curr_fields[3], "\n";
                }
                else {
#if it doesn't, since all the file is sorted the overlaps are done with
                        $i=$j;
                        last;
                }
        }
        $i++;
}

完成此脚本后,可以将标准输出放入文件并再次运行脚本,以便可以再次删除重叠

不幸的是,我的原始ARGV [0]有大约1300行,它以某种方式产生大约6000行输出到标准输出

很抱歉,如果这令人困惑,对我来说这是一个棘手的概念,但如果你有问题可以提出任何问题

感谢您

额外的例子

如果z.txt再次运行此代码,则应打印到ARGV [1]

9   9000000 32800000    4
9   89600000    129300000   8
S   0   24900000    22
S   42500000 47000000 9

标准输出应打印

9   9000000 40200000    6
9   9000000 42400000    5
9   89600000    140273252   52
S   0   37500000    2

1 个答案:

答案 0 :(得分:0)

首先,这项任务有一个错误:$ i = $ j;这样,当发生重叠时,您会超过一条输入线。 其次,内循环列表应为($ i + 1 .. $ ARGV [3] -1),而不是($ i + 1 .. $ ARGV [3])。 第三,当文件末尾没有重叠时,你需要跳过已处理的行。 否则你会一遍又一遍地获得相同的线条。 它可以这样做(注意$ new_i):

while ($i < $ARGV[3]) {
...
    my $new_i=$i;
    for my $j ($i+1..$ARGV[3]-1) {
        $new_i=$j;
        ...
        else {
            # you don't need this assignment
            # $i=$j;
            last;
        }
    }
    $i=$new_i+1;
}