Question

基于数百万个价格范围的数据集，我们需要找到包含给定价格的最小范围。
适用以下规则：

范围可以完全嵌套（即1-10和5-10有效）
范围不能部分嵌套（即1-10和5- 15 无效）

示例：
给定以下价格范围：

1-100
50-100
100-120
5-10
5-20

搜索价格 7 的结果应为 5-10
搜索价格 100 的结果应为 100-120 （最小范围为100）。

实现此目的最有效的算法/数据结构是什么？
在网上搜索时，我只找到了在范围内搜索范围的解决方案。
我一直在查看Morton计数和希尔伯特曲线，但无法确定如何在这种情况下使用它们。
谢谢。

Answer 1

由于您没有提到此即席算法，因此我将其作为您问题的简单答案：

这是python函数，但很容易理解并将其转换为另一种语言。

def min_range(ranges, value):
    # ranges = [(1, 100), (50, 100), (100, 120), (5, 10), (5, 20)]
    # value = 100

    # INIT
    import math
    best_range = None
    best_range_len = math.inf

    # LOOP THROUGH ALL RANGES
    for b, e in ranges:

        # PICK THE SMALLEST
        if b <= value <= e and e - b < best_range_len:
            best_range = (b, e)
            best_range_len = e - b

    print(f'Minimal range containing {value} = {best_range}')

我相信会有更高效，更复杂的解决方案（例如，如果您可以进行一些预计算），但这是您必须采取的第一步。

编辑：这是一个更好的解决方案，可能在O（log（n））中，但它并不简单。它是一棵树，其中每个节点都是一个间隔，并具有包含在其中的所有严格不重叠的间隔的子级列表。预处理是在O（n log（n））时间内完成的，在最坏的情况下（当您找不到两个不重叠的范围时）查询是O（n），平均而言可能是O（log（n））。

2个类：包含树并可以查询的树：

class tree:
    def __init__(self, ranges):
        # sort the ranges by lowest starting and then greatest ending
        ranges = sorted(ranges, key=lambda i: (i[0], -i[1]))
        # recursive building -> might want to optimize that in python
        self.node = node( (-float('inf'), float('inf')) , ranges)

    def __str__(self):
        return str(self.node)

    def query(self, value):
        # bisect is for binary search
        import bisect
        curr_sol = self.node.inter
        node_list = self.node.child_list

        while True:
            # which of the child ranges can include our value ?
            i = bisect.bisect_left(node_list, (value, float('inf'))) - 1
            # does it includes it ?
            if i < 0 or i == len(node_list):
                return curr_sol
            if value > node_list[i].inter[1]:
                return curr_sol
            else:
                # if it does then go deeper
                curr_sol = node_list[i].inter
                node_list = node_list[i].child_list

保存结构和信息的节点：

class node:
    def __init__(self, inter, ranges):
        # all elements in ranges will be descendant of this node !
        import bisect

        self.inter = inter
        self.child_list = []

        for i, r in enumerate(ranges):
            if len(self.child_list) == 0:
                # append a new child when list is empty
                self.child_list.append(node(r, ranges[i + 1:bisect.bisect_left(ranges, (r[1], r[1] - 1))]))

            else:
                # the current range r is included in a previous range 
                # r is not a child of self but a descendant !
                if r[0] < self.child_list[-1].inter[1]:
                    continue
                # else -> this is a new child
                self.child_list.append(node(r, ranges[i + 1:bisect.bisect_left(ranges, (r[1], r[1] - 1))]))

    def __str__(self):
        # fancy
        return f'{self.inter} : [{", ".join([str(n) for n in self.child_list])}]'

    def __lt__(self, other):
        # this is '<' operator -> for bisect to compare our items
        return self.inter < other

并进行测试：

ranges = [(1, 100), (50, 100), (100, 120), (5, 10), (5, 20), (50, 51)]
t = tree(ranges)
print(t)
print(t.query(10))
print(t.query(5))
print(t.query(40))
print(t.query(50))

Answer 2

产生不连续间隔的预处理
（我将源细分称为范围，将结果细分称为区间）

对于任何范围边界（包括开始和结束），请创建元组：（值，开始/结束字段，范围长度，id），将其放入数组/列表中

将这些元组按第一个字段排序。如果出现平局，则在起点和终点之间留更长的距离。

Make a stack
Make StartValue variable.
Walk through the list:
     if current tuple contains start:
          if interval is opened:   //we close it
             if  current value > StartValue:   //interval is not empty
                  make interval with   //note id remains in stack
                      (start=StartValue, end = current value, id = stack.peek)       
                  add interval to result list
          StartValue = current value //we open new interval
          push id from current tuple onto stack
     else:   //end of range
             if  current value > StartValue:   //interval is not empty
                 make interval with    //note id is removed from stack
                      (start=StartValue, end = current value, id = stack.pop)
                 add interval to result list
         if stack is not empty:
              StartValue = current value //we open new interval

之后，我们对不连续的间隔列表进行了排序，这些间隔包含源范围的开始/结束值和ID（请注意，许多间隔可能对应于相同的源范围），因此我们可以轻松地使用二进制搜索。

如果我们按嵌套顺序（在父级之后嵌套）一个接一个地添加源范围，我们可以看到每个新范围最多可能生成两个新间隔，因此间隔M <= 2*N的总数和复杂度为O(Nlog N + Q * logN)，其中Q为查询次数

编辑： 添加了if stack is not empty部分

示例1-100、50-100、100-120、5-10、5-20的结果是

1-5(0), 5-10(3), 10-20(4), 20-50(0), 50-100(1), 100-120(2)

Answer 3

由于pLOPeGG已经涵盖了临时案例，我将在执行预处理以有效支持多个查询的前提下回答这个问题。

高效查询间隔的常规数据结构是Interval Tree和Segment Tree

Answer 4

这样的方法呢？由于我们只允许嵌套而不是部分嵌套。这似乎是可行的方法。

将分段分为(left,val)和(right,val)对。
根据它们的vals和左/右关系对其进行排序。
使用二进制搜索来搜索列表。我们发现并没有发现两个结果。
如果找到，请检查是左还是右。如果是左侧，请向右行驶，直到找到右侧而不找到左侧为止。如果是右侧，请向左走，直到找到左侧而不找到右侧为止。选择最小的。
如果未找到high-low为1或0时停止，则将查询的值与您所在节点的值进行比较，然后像以前一样根据该搜索的左右方向进行搜索。

作为一个例子；

搜索(l,10) (l,20) (l,30) (r,45) (r,60) (r,100)时会有(r,100)，如果您在(l,x)上落下65，则您向左走，找不到x>=65的地点，因此O(n)因此，您一直走到左边，直到获得左右平衡为止，而最右边和最左边是间隔。重新处理部分会很长，但是由于您将保持这种方式。在最坏的情况下，它仍然是Dim sheet As Worksheet Application.ScreenUpdating = False Application.DisplayAlerts = False dr_1 = "<path for the directory 1>" Workbooks.Open (dr_1 & fl_1) Windows("<file name 1.xlsx>").Activate Range("A1").Select Range(Selection, ActiveCell.SpecialCells(xlLastCell)).Select Selection.Copy Windows("<Target File Name.xlsm>").Activate Sheets("Tab_File1").Select Range("A1").Select ActiveSheet.Paste Application.CutCopyMode = False。但是最糟糕的情况是，您需要将所有内容相互嵌套，并搜索最外面的内容。

寻找包含一个点的最小范围的最有效的算法/数据结构是什么？

4 个答案: