Question

今天早些时候，我看到一个 - 显然已经很糟糕，因此已经删除 - 关于删除重叠间隔（或范围，间隔从此开始）的问题。问题是如何删除完全在其他区间内的间隔。例如，我们有以下内容：

1-2
2-3
1-3
2-4

或者可视化稍微好一些：

1-2
  2-3
1---3
  2---4

间隔1-2和2-3都已移除，因为它们包含在区间1-3中，因此输出将为：

1-3
2-4

先验算法可能是检查每个时间间隔，导致O（n ²）比较。有人建议在处理之前对源数据进行排序，还有其他角度来解决这个问题吗？

明显的情况是（数据排序）：

1-3    remove
1--4

1-3    remove this or next
1-3

1--4
 2-4   remove

1---5
 2-4   remove

1-3    print this, maybe next depending on the one after that
 2-4

如果您在数据或附属标签中提出了很好的陷阱或其他情况，请添加它们。

Answer 1

此解决方案期望在处理之前对数据进行排序，如某人所建议的那样：

$ sort -t- -k1n -k2n file  # playin' it safe
1-2
1-3
2-3
2-4

在awk中：

$ cat program.awk
BEGIN { OFS=FS="-" }
{
    if(p=="") {                     # if p is empty, fill it
        p=$0                        # p is the previous record
        next
    }
    split(p,b,"-")                  # p is split to start and end to b[]

    if(b[1] == $1 && b[2] <= $2) {  # since sorting is expected:
        p=$0                        # if starts are equal p line is included or identical
        next                        # so remove it
    }
    else if($2 <= b[2])             # latter is included
        next

    print p                         # no complete overlap, print p 
    p=$0                            # and to the next
}
END { print p }

运行它：

$ awk -f program.awk <(sort -t- -k1n -k2n file)
1-3
2-4

或

1-2
  2-3

Answer 2

只要算法具有多项式复杂度，我认为直截了当的解决方案也是可以的：

#!/usr/bin/gawk -f

BEGIN {
    FS=OFS="-";
}
{

    arr[NR][1] = $1;
    arr[NR][2] = $2;
}
END {

    for(i in arr) {

        delete_nxt_elem(i);

        if(arr[i][1]!="")
            print arr[i][1],arr[i][2];
    }
}

function delete_nxt_elem(check_indx,   j) {

    for(j in arr) {

        if(j==check_indx)
            continue;

        if(arr[j][1]<=arr[check_indx][1] && arr[j][2]>=arr[check_indx][2])
            delete arr[check_indx];
    }
}

删除完全重叠的间隔或范围

2 个答案: