Question

我有一个非常大的数据集（从100,000个元素到250,000个元素），我目前正在向量中存储数据，目的是搜索一组单词。给定一个短语（例如“on，para”），该函数应该找到以给定短语开头的所有单词并推送队列中的所有匹配。

要找到我使用二进制搜索的初始单词，这似乎效果很好，但是在找到最初的单词之后我就会卡住。我应该如何在元素之前和之后有效地迭代以找到所有相似的单词？输入是按字母顺序排列的，所以我知道所有其他可能的匹配将在元素返回之前或之后发生。我觉得必须有<algorithm>的功能，我可以利用它。以下是相关代码的一部分：

二进制搜索功能：

int search(std::vector<std::string>& dict, std::string in)
{
    //for each element in the input vector
    //find all possible word matches and push onto the queue
    int first=0, last= dict.size() -1;
    while(first <= last)
    {
        int middle = (first+last)/2;
        std::string sub = (dict.at(middle)).substr(0,in.length());
        int comp = in.compare(sub);
        //if comp returns 0(found word matching case)
        if(comp == 0) {
            return middle;
        }
        //if not, take top half
        else if (comp > 0)
            first = middle + 1;
        //else go with the lower half
        else
            last = middle - 1;
    }
    //word not found... return failure
    return -1;
}

在main()

//for each element in our "find word" vector
for (int i = 0; i < input.size()-1; i++)
{
    // currently just finds initial word and displays
    int key = search(dictionary, input.at(i));
    std::cout << "search found " << dictionary.at(key) <<
                 "at key location " << key << std::endl;
}

Answer 1

std :: lower_bound并向前迭代（你也可以使用std :: upper_bound）：

#include <algorithm>
#include <iostream>
#include <vector>

int main() {
    typedef std::vector<std::string> Dictionary;
    Dictionary dictionary = {
        "A", "AA", "B", "BB", "C", "CC"
    };
    std::string prefix("B");
    Dictionary::const_iterator pos = std::lower_bound(
        dictionary.begin(),
        dictionary.end(),
        prefix);
    for( ; pos != dictionary.end(); ++pos) {
        if(pos->compare(0, prefix.size(), prefix) == 0) {
            std::cout << "Match: " << *pos << std::endl;
        }
        else break;
    }
    return 0;
}

Answer 2

您需要为每个短语构建索引，而不是为任何子短语构建索引。从这个词开始。例如，对于dict-string“New York”，您必须保留两个字符串的索引：“New York”和“York”。请参阅我的自动完成演示，它演示了这个想法：

http://olegh.cc.st/autocomplete.html

如您所见，此子系统可快速使用字典，最大值为250K元素。当然，我不使用二进制搜索，因为它很慢。我改用哈希。

Answer 3

有序向量（列表）当然是存储数据的一种方式，但保持项目的有序性具有效率成本。你没有提到你的阵列是静态的还是动态的。但是还有其他数据结构允许存储已排序的数据并具有非常好的查找时间。

哈希/地图 - 您可以将项目存储为哈希/地图并快速查找，但查找下一个和上一个是有问题的。
二叉树/ N-ary树/ B-Tree - 非常好的动态插入/删除性能，以及良好的查找时间，并且树是有序的，因此查找next / previous是稳定的。
布隆过滤器 - 有时你想做的就是检查一个项目是否在你的收藏中，而布隆过滤器的假阳性非常低，所以这是一个不错的选择。

假设您将数据分解为短子序列（音节），那么您可以拥有一个音节树，非常快速的查找，并且根据树是实现为有序列表还是哈希/映射，您也可能是能够找到下一个/上一个。

如何在关键短语的元素之前和之后有效搜索

3 个答案: