我这样做是为了避免O(n!)时间的复杂性,但我现在只有伪代码,因为有一些我不确定实现的东西。
这是我要传递给此脚本的文件格式。数据按第三列 - 起始位置排序。
93 Blue19 1 82
87 Green9 1 7912
76 Blue7 2 20690
65 Red4 2 170
...
...
256 Orange50 17515 66740
166 Teal68 72503 123150
228 Green89 72510 114530
代码说明:
我想创建一个数组数组,以便在两条信息的长度重叠时查找。
输入文件的第3列和第4列是单个轨道线上的起始位置和停止位置。如果任何行(x)在第3列中的位置比任何行(y)中第4列中的位置短,那么这意味着x在y结束之前开始并且存在一些重叠。
我想找到与asnyrow重叠的每一行,而不必将每一行与每一行进行比较。因为它们是排序的,所以我只需将一个字符串添加到数组的内部数组中,该数组代表一行。 如果正在查看的新行不与数组中已有的行之一重叠,那么(因为数组按第三列排序),没有更多的行能够与数组中的行重叠并且可以删除它
这就是我对
的看法#!/usr/bin/perl -w
use strict;
my @array
while (<>) {
my thisLoop = ($id, $name, $begin, $end) = split;
my @innerArray = split; # make an inner array with the current line, to
# have strings that will be printed after it
push @array(@innerArray)
for ( @array ) { # loop through the outer array being made to see if there
# are overlaps with the current item
if ( $begin > $innerArray[3]) # if there are no overlaps then print
# this inner array and remove it
# (because it is sorted and everything
# else cannot overlap because it is
# larger)
# print @array[4-]
# remove this item from the array
else
# add to array this string
"$id overlap with innerArray[0] \t innerArray[0]: $innerArray[2], $innerArray[3] "\t" $id : $begin, $end
# otherwise because there is overlap add a statement to the inner
# array explaining the overlap
代码应该产生类似
的东西87 overlap with 93 93: 1 82 87: 1 7982
76 overlap with 93 93: 1 82 76: 1 20690
65 overlap with 93 93: 1 82 65: 2 170
76 overlap with 87 87: 1 7912 76: 2 20690
65 overlap with 87 87: 1 7912 65: 2 170
65 overlap with 76 76: 2 20690 65: 2 170
256 overlap with 76 76: 2 20690 256: 17515 66740
228 overlap with 166 166: 72503 123150 228: 72510 114530
这很难解释所以问我是否有任何问题
答案 0 :(得分:1)
如果您将样本数据作为输入,这将产生您要求的输出。它运行在一毫秒以内
你有没有解释的其他限制吗?让代码运行得更快永远不应该是目的。 O(n!) 时间复杂度没有任何内在错误:它是您必须考虑的执行时间,如果您的代码足够快,那么您的工作就是完成
use strict;
use warnings 'all';
my @data = map [ split ], grep /\S/, <DATA>;
for my $i1 ( 0 .. $#data ) {
my $v1 = $data[$i1];
for my $i2 ( $i1 .. $#data ) {
my $v2 = $data[$i2];
next if $v1 == $v2;
unless ( $v1->[3] < $v2->[2] or $v1->[2] > $v2->[3] ) {
my $statement = sprintf "%d overlap with %d", $v2->[0], $v1->[0];
printf "%-22s %d: %d %-7d %d: %d %-7d\n", $statement, @{$v1}[0, 2, 3], @{$v2}[0, 2, 3];
}
}
}
__DATA__
93 Blue19 1 82
87 Green9 1 7912
76 Blue7 2 20690
65 Red4 2 170
256 Orange50 17515 66740
166 Teal68 72503 123150
228 Green89 72510 114530
87 overlap with 93 93: 1 82 87: 1 7912
76 overlap with 93 93: 1 82 76: 2 20690
65 overlap with 93 93: 1 82 65: 2 170
76 overlap with 87 87: 1 7912 76: 2 20690
65 overlap with 87 87: 1 7912 65: 2 170
65 overlap with 76 76: 2 20690 65: 2 170
256 overlap with 76 76: 2 20690 256: 17515 66740
228 overlap with 166 166: 72503 123150 228: 72510 114530
答案 1 :(得分:1)
我使用发布的输入和输出文件作为所需内容的指南。
关于复杂性的说明。原则上,必须将每一行与所有后续行进行比较。实际执行的操作数取决于数据。由于声明数据在要比较的字段上排序,一旦重叠停止,就可以切断内循环迭代。关于复杂性估计的评论已经结束。
将每行与其后的行进行比较。为此,所有行首先被读入数组。如果数据集非常大,则应将其更改为逐行读取,然后转换过程,以将当前读取的行与之前的所有行进行比较。这是一种非常基本的方法。最好先构建辅助数据结构,然后再使用合适的库。
use warnings;
use strict;
my $file = 'data_overlap.txt';
my @lines = do {
open my $fh, '<', $file or die "Can't open $file -- $!";
<$fh>;
};
# For each element compare all following ones, but cut out
# as soon as there's no overlap since data is sorted
for my $i (0..$#lines)
{
my @ref_fields = split '\s+', $lines[$i];
for my $j ($i+1..$#lines)
{
my @curr_fields = split '\s+', $lines[$j];
if ( $ref_fields[-1] > $curr_fields[-2] ) {
print "$curr_fields[0] overlap with $ref_fields[0]\t" .
"$ref_fields[0]: $ref_fields[-2] $ref_fields[-1]\t" .
"$curr_fields[0]: $curr_fields[-2] $curr_fields[-1]\n";
}
else { print "\tNo overlap, move on.\n"; last }
}
}
使用文件'data_overlap.txt'
中的输入打印
87 overlap with 93 93: 1 82 87: 1 7912 76 overlap with 93 93: 1 82 76: 2 20690 65 overlap with 93 93: 1 82 65: 2 170 No overlap, move on. 76 overlap with 87 87: 1 7912 76: 2 20690 65 overlap with 87 87: 1 7912 65: 2 170 No overlap, move on. 65 overlap with 76 76: 2 20690 65: 2 170 256 overlap with 76 76: 2 20690 256: 17515 66740 No overlap, move on. No overlap, move on. No overlap, move on. 228 overlap with 166 166: 72503 123150 228: 72510 114530
对复杂性的评论
最坏情况每个元素必须相互比较(它们都重叠)。这意味着对于每个元素,我们需要N-1
个比较,并且我们有N
个元素。这是O(N^2)
复杂度。这种复杂性对于经常使用的操作以及潜在的大型数据集(如库所做的)来说并不好。但对于某个特定问题来说,这并不一定是坏事 - 数据集仍然需要非常大才能导致运行时间过长。
最佳案例每个元素只进行一次比较(完全没有重叠)。这意味着N
比较,因此O(N)
复杂。
平均值让我们假设每个元素与下面的“少数”重叠,让我们说3(三)。这意味着将进行3N
次比较。这仍然是O(N)
复杂性。只要比较的数量不依赖于列表的长度(但是是常数),这就成立了,这是一个非常合理的典型场景。这很好。
感谢ikegami在评论中提出这一点以及估算值。
请记住,技术的计算复杂性的重要性取决于它的使用。