Question

我这样做是为了避免O（n！）时间的复杂性，但我现在只有伪代码，因为有一些我不确定实现的东西。

这是我要传递给此脚本的文件格式。数据按第三列 - 起始位置排序。

93 Blue19 1 82
87 Green9 1 7912
76 Blue7 2 20690
65 Red4 2 170
...
...
256 Orange50 17515 66740
166 Teal68 72503 123150
228 Green89 72510 114530

代码说明：

我想创建一个数组数组，以便在两条信息的长度重叠时查找。

输入文件的第3列和第4列是单个轨道线上的起始位置和停止位置。如果任何行（x）在第3列中的位置比任何行（y）中第4列中的位置短，那么这意味着x在y结束之前开始并且存在一些重叠。

我想找到与asnyrow重叠的每一行，而不必将每一行与每一行进行比较。因为它们是排序的，所以我只需将一个字符串添加到数组的内部数组中，该数组代表一行。如果正在查看的新行不与数组中已有的行之一重叠，那么（因为数组按第三列排序），没有更多的行能够与数组中的行重叠并且可以删除它

这就是我对

的看法

#!/usr/bin/perl -w

use strict;

my @array

while (<>) {

    my thisLoop = ($id, $name, $begin, $end) = split;
    my @innerArray = split; # make an inner array with the current line, to 
                            # have strings that will be printed after it

    push @array(@innerArray)

    for ( @array ) { # loop through the outer array being made to see if there 
                     # are overlaps with the current item

        if ( $begin > $innerArray[3]) # if there are no overlaps then print 
                                      # this inner array and remove it
                                      # (because it is sorted and everything
                                      # else cannot overlap because it is 
                                      # larger)
            # print @array[4-]
            # remove this item from the array
        else
            # add to array this string
            "$id overlap with innerArray[0] \t innerArray[0]: $innerArray[2], $innerArray[3] "\t" $id :  $begin, $end         
            # otherwise because there is overlap add a statement to the inner
            # array explaining the overlap

代码应该产生类似

的东西

87 overlap with 93     93: 1 82      87: 1 7982
76 overlap with 93     93: 1 82      76: 1 20690
65 overlap with 93     93: 1 82      65: 2 170
76 overlap with 87     87: 1 7912    76: 2 20690
65 overlap with 87     87: 1 7912    65: 2 170
65 overlap with 76     76: 2 20690   65: 2 170
256 overlap with 76    76: 2 20690   256: 17515 66740
228 overlap with 166   166: 72503 123150   228: 72510 114530

这很难解释所以问我是否有任何问题

Answer 1

如果您将样本数据作为输入，这将产生您要求的输出。它运行在一毫秒以内

你有没有解释的其他限制吗？让代码运行得更快永远不应该是目的。 O（n！） 时间复杂度没有任何内在错误：它是您必须考虑的执行时间，如果您的代码足够快，那么您的工作就是完成

use strict;
use warnings 'all';

my @data = map [ split ], grep /\S/, <DATA>;

for my $i1 ( 0 .. $#data ) {

    my $v1 = $data[$i1];

    for my $i2 ( $i1 .. $#data ) {

        my $v2 = $data[$i2];

        next if $v1 == $v2;

        unless ( $v1->[3] < $v2->[2] or $v1->[2] > $v2->[3] ) {
            my $statement = sprintf "%d overlap with %d", $v2->[0], $v1->[0];
            printf "%-22s %d: %d %-7d %d: %d %-7d\n", $statement, @{$v1}[0, 2, 3], @{$v2}[0, 2, 3];

        }
    }
}

__DATA__
93 Blue19 1 82
87 Green9 1 7912
76 Blue7 2 20690
65 Red4 2 170
256 Orange50 17515 66740
166 Teal68 72503 123150
228 Green89 72510 114530

输出

87 overlap with 93     93: 1 82      87: 1 7912   
76 overlap with 93     93: 1 82      76: 2 20690  
65 overlap with 93     93: 1 82      65: 2 170    
76 overlap with 87     87: 1 7912    76: 2 20690  
65 overlap with 87     87: 1 7912    65: 2 170    
65 overlap with 76     76: 2 20690   65: 2 170    
256 overlap with 76    76: 2 20690   256: 17515 66740  
228 overlap with 166   166: 72503 123150  228: 72510 114530

Answer 2

我使用发布的输入和输出文件作为所需内容的指南。

关于复杂性的说明。原则上，必须将每一行与所有后续行进行比较。实际执行的操作数取决于数据。由于声明数据在要比较的字段上排序，一旦重叠停止，就可以切断内循环迭代。关于复杂性估计的评论已经结束。

将每行与其后的行进行比较。为此，所有行首先被读入数组。如果数据集非常大，则应将其更改为逐行读取，然后转换过程，以将当前读取的行与之前的所有行进行比较。这是一种非常基本的方法。最好先构建辅助数据结构，然后再使用合适的库。

use warnings;
use strict;

my $file = 'data_overlap.txt';
my @lines = do { 
    open my $fh, '<', $file or die "Can't open $file -- $!";
    <$fh>;
};

# For each element compare all following ones, but cut out 
# as soon as there's no overlap since data is sorted
for my $i (0..$#lines) 
{  
    my @ref_fields = split '\s+', $lines[$i];
    for my $j ($i+1..$#lines) 
    {   
        my @curr_fields = split '\s+', $lines[$j]; 
        if ( $ref_fields[-1] > $curr_fields[-2] ) { 
            print "$curr_fields[0] overlap with $ref_fields[0]\t" .
                "$ref_fields[0]: $ref_fields[-2] $ref_fields[-1]\t" .
                "$curr_fields[0]: $curr_fields[-2] $curr_fields[-1]\n";
        }   
        else { print "\tNo overlap, move on.\n"; last }
    }   
}

使用文件'data_overlap.txt'中的输入打印

87 overlap with 93      93: 1 82        87: 1 7912
76 overlap with 93      93: 1 82        76: 2 20690
65 overlap with 93      93: 1 82        65: 2 170
        No overlap, move on.
76 overlap with 87      87: 1 7912      76: 2 20690
65 overlap with 87      87: 1 7912      65: 2 170
        No overlap, move on.
65 overlap with 76      76: 2 20690     65: 2 170
256 overlap with 76     76: 2 20690     256: 17515 66740
        No overlap, move on.
        No overlap, move on.
        No overlap, move on.
228 overlap with 166    166: 72503 123150       228: 72510 114530

对复杂性的评论

最坏情况每个元素必须相互比较（它们都重叠）。这意味着对于每个元素，我们需要N-1个比较，并且我们有N个元素。这是O(N^2)复杂度。这种复杂性对于经常使用的操作以及潜在的大型数据集（如库所做的）来说并不好。但对于某个特定问题来说，这并不一定是坏事 - 数据集仍然需要非常大才能导致运行时间过长。

最佳案例每个元素只进行一次比较（完全没有重叠）。这意味着N比较，因此O(N)复杂。

平均值让我们假设每个元素与下面的“少数”重叠，让我们说3（三）。这意味着将进行3N次比较。这仍然是O(N)复杂性。只要比较的数量不依赖于列表的长度（但是是常数），这就成立了，这是一个非常合理的典型场景。这很好。

感谢ikegami在评论中提出这一点以及估算值。

请记住，技术的计算复杂性的重要性取决于它的使用。

在perl中创建一个数组数组并从数组中删除

2 个答案:

输出