Question

我有一个带有一些排序数据的文本文件，用换行符分割。例如：

... abc123 abc124 abd123 abd124 abd125 ...

现在我想为数据集创建一个索引，该索引应该（至少）支持：

getStringByIndex（n）：返回已排序列表的第n项;
getIndexByString（s）：在所有项目中查找s，返回其索引（如果未找到则返回-1）;

我已经阅读了一些索引算法，如散列和B树。具有额外儿童尺寸的B树应该做得很好。但是由于日期集是排序的，我想知道是否有一个更有效的解决方案，而不是通过插入所有项目来构建B树？

Answer 1

由于数据已排序，因此只需在内存中保留一小块稀疏的数据子集，即可快速有效地定位内容。例如，假设我们决定将每个第N个元素存储在内存中。为了有效地初始化API，您需要在磁盘上的单独文件中编译此稀疏列表，因此您不必通过100GB数据流来获取它。

对于这些术语中的每一个，您需要相对于术语开始位置的文件头保存磁盘偏移量。然后，您所要做的就是将稀疏列表/偏移对加载到内存中，并且您的两个请求的实现变得简单：

    getStringByIndex(n):
        Get floor(n/N)-th string/offset pair from list
        Seek offset position in index
        Read/Skip n mod N strings, then return the next one

    getIndexByString(s):
        Binary search over sparse list in memory
            Locate lower and upper bound string/offset pairs
        If a string/offset pair is in the i-th position in our sparse list,
            then the string itself is the (N x i)-th string in our index.
            We can use this information to compute the return value
        If the string we want isn't in memory:
            Seek lower-bound offset in index
            Read strings until we:
                a) Find a match
                b) Reach the high-bound offset
                c) Reach a string which is lexicographically greater than the one we are looking for
        Else
            Just return the index for the matching string in our sparse list

如果索引中的字符串是固定宽度的，则可以进行更大的优化。

如果您实施此算法，您将需要注意选择此算法的“N”。请记住，从磁盘上的位置读取10个字节的成本并不比从同一位置读取10,000个字节的成本低得多：这是磁盘搜索的开销，以及进出I / O调用的进程。伤害最多。

在排序数据上创建索引

1 个答案: